# Preprocessing

Dataset: Quotes-500K (https://github.com/ShivaliGoel/Quotes-500K)

Download links: 
- https://www.mediafire.com/file/elzplaxcnpf91qs/quotes.csv/file
- https://drive.google.com/file/d/1M5TTqsLw7uVZfJfvH-fNpRxL9Mw0Iky8/view?usp=sharing
- https://archive.org/details/quotes_20230625

In [30]:
import pandas as pd
import re

In [31]:
quotes_df = pd.read_csv('quotes.csv')
quotes_df

Unnamed: 0,quote,author,category
0,"I'm selfish, impatient and a little insecure. ...",Marilyn Monroe,"attributed-no-source, best, life, love, mistak..."
1,You've gotta dance like there's nobody watchin...,William W. Purkey,"dance, heaven, hurt, inspirational, life, love..."
2,You know you're in love when you can't fall as...,Dr. Seuss,"attributed-no-source, dreams, love, reality, s..."
3,A friend is someone who knows all about you an...,Elbert Hubbard,"friend, friendship, knowledge, love"
4,Darkness cannot drive out darkness: only light...,"Martin Luther King Jr., A Testament of Hope: T...","darkness, drive-out, hate, inspirational, ligh..."
...,...,...,...
499704,I do believe the most important thing I can do...,John C. Stennis,"Past, Believe, Help"
499705,I'd say I'm a bit antimadridista although I do...,Isco,"Team, Humility, Know"
499706,The future is now.,Nam June Paik,Now
499707,"In all my life and in the future, I will alway...",Norodom Sihamoni,"Life, My Life, Servant"


In [32]:
# Remove rows where quote is empty
quotes_df = quotes_df.dropna(subset=['quote'])

# Some rows have nulls in the author or category - we keep these
quotes_df = quotes_df.fillna('Unknown')

In [33]:
# Some entries in 'author' have NAME, BOOK TITLE (separated by comma): just get the NAME and discard BOOK TITLE
quotes_df['author'] = quotes_df['author'].apply(lambda x: x.split(',')[0] if isinstance(x, str) else x)
# Replace multiple spaces with single space
quotes_df['author'] = quotes_df['author'].apply(lambda x: re.sub(r' +', r' ', x) if isinstance(x, str) else x)
quotes_df['author'] = quotes_df['author'].str.strip()

def get_mismatched_rows(df):
    """
    Returns the rows in the dataframe where the data is entered incorrectly
    among the 'quote,' 'author,' and 'category' columns. Often, this occurs as
    part of the quote being entered in 'author', and the author being in
    'category.'

    We filter where the 'author' value is more than 5 words long as well as
    not being capitalised. This means the rows returned are unlikely to have
    the actual author in 'author', so we can safely presume the 'author' value
    is the continuation of the quote in 'quote.' This justifies concatenating
    the 'quote' and 'author' values after calling this function.
    """
    df['author_len'] = df['author'].apply(lambda x: len(x.split(' ')))
    df['author_caps'] = df['author'].apply(lambda x: x[0].isupper() if x != '' else False)

    rows_to_fix = df[(df['author_len'] > 5) & (df['author_caps'] == False)]
    return rows_to_fix

In [34]:
rows_to_fix = get_mismatched_rows(quotes_df)

# Concatenate the 'quote' and 'author' values
for index, row in rows_to_fix.iterrows():
    # Concatenate the 'author' value to the 'quote' and update 'author' with 'category'
    rows_to_fix.at[index, 'quote'] = row['quote'] + " " + row['author']
    rows_to_fix.at[index, 'author'] = row['category']
    rows_to_fix.at[index, 'category'] = ' '

rows_fixed = rows_to_fix
rows_fixed.head()

Unnamed: 0,quote,author,category,author_len,author_caps
807,If you love something so much let it go. If it...,Albert Schweitzer,,6,False
1236,Submission is not about authority and it is no...,"William Paul Young, The Shack",,9,False
1917,Emotion without reason lets people walk all ov...,"Nalini Singh, Archangel's Kiss",,8,False
2730,We all are so deeply interconnected we have no...,"Amit Ray, Yoga and Vipassana: An Integrated Li...",,34,False
2882,I'd run my whole life long to reach you paddle...,traverse Jungle and Desert to find you,,7,False


In [35]:
# Replace the incorrect rows with the newly fixed rows
cleaned_df = pd.concat([quotes_df.drop(rows_fixed.index), rows_fixed])

get_mismatched_rows(cleaned_df)
# Now, we're just left with the rows where the author is unknown because the
# quote was split over all three columns. We, again, combine the values in
# 'quote' and 'author.'

Unnamed: 0,quote,author,category,author_len,author_caps
2882,I'd run my whole life long to reach you paddle...,traverse Jungle and Desert to find you,,8,False
3151,The real things haven't changed. It is still b...,to be happy with simple pleasures,,7,False
3423,Do not let arrogance go to your head and despa...,do not let success go to your head and failur...,,14,False
4051,You don't ask people with knives in their stom...,it's all about whether you pull the knife out...,,18,False
4685,This new day has greeted us with no rules unco...,with open arms and endless possibility.,,7,False
...,...,...,...,...,...
393050,Give a man a proverb and he’ll muse for a mome...,and he’ll walk in wisdom for a lifetime.,,8,False
393701,Good leaders are intelligent great leaders are...,great leaders are fearless.Good leaders are ar...,,7,False
394119,It is impossible to catch today it flows out f...,it flows out from our palms! We cannot hold them,,11,False
423426,I hate when people play politics with me when ...,karma is still treating me like a bitch. Some...,,13,False


In [36]:
rows_to_fix = get_mismatched_rows(cleaned_df)

# Concatenate the 'quote' and 'author' values
for index, row in rows_to_fix.iterrows():
    # Concatenate the 'author' value to the 'quote' and update 'author' with 'category'
    rows_to_fix.at[index, 'quote'] = row['quote'] + " " + row['author']
    rows_to_fix.at[index, 'author'] = row['category']
    rows_to_fix.at[index, 'category'] = ' '

rows_fixed = rows_to_fix
rows_fixed.head()

Unnamed: 0,quote,author,category,author_len,author_caps
2882,I'd run my whole life long to reach you paddle...,,,8,False
3151,The real things haven't changed. It is still b...,,,7,False
3423,Do not let arrogance go to your head and despa...,,,14,False
4051,You don't ask people with knives in their stom...,,,18,False
4685,This new day has greeted us with no rules unco...,,,7,False


In [37]:
# Replace the incorrect rows with the newly fixed rows
cleaned_df = pd.concat([cleaned_df.drop(rows_fixed.index), rows_fixed])

In [38]:
get_mismatched_rows(cleaned_df)
# Left with nothing mismatched according to our heuristic in the mismatched function!

Unnamed: 0,quote,author,category,author_len,author_caps


In [39]:
# Replace all empty strings with whitespace before converting to CSV.
# Otherwise, when we read the CSV back in, the value will be registered as NULL.
cleaned_df = cleaned_df.replace('', ' ')

# Remove indicator columns we no longer need
output_df = cleaned_df.drop(['author_len', 'author_caps'], axis=1)

# Export
output_df.to_csv('quotes_clean.csv', index=False)

### An aside...

In [40]:
# There are still 2000+ rows where the quote is split... but these are more
# annoying to handle because the 'author' starts with a capital letter, but is
# longer than 5 words. This could be an author, or an author + book title.
# 2500 is only 0.5% of our 500,000 quote dataset, so not worth the effort!
cleaned_df[cleaned_df['author_len'] > 5]

Unnamed: 0,quote,author,category,author_len,author_caps
2757,Don't touch me,I'll die if you touch me.,"Vladimir Nabokov, Lolita",6,True
2885,Just in case you ever foolishly forget,I'm never not thinking of you.,"Virginia Woolf, Selected Diaries",6,True
2918,I'm not afraid of death,I just don't want to be there when it happens.,Woody Allen,10,True
3418,I don't want to achieve immortality through my...,I want to achieve immortality through not dyin...,I want to live on in my apartment.,20,True
3965,Life is but a day,A fragile dew-drop on its perilous wayFrom a t...,"John Keats, The Complete Poems",10,True
...,...,...,...,...,...
423387,Toys can be anything children can play all mor...,"Iben Dissing Sandahl, Play The Danish Way: A G...",,18,True
423441,Our play is not something separate from our sp...,"Ken Shigematsu, God in My Everything: How an A...",,15,True
423917,A right doesn't include the material implement...,"Ayn Rand, The Virtue of Selfishness: A New Con...",,11,True
424148,Strength is not created by adversity it is mer...,"Mark Eddy Smith, Tolkien's Ordinary Virtues: E...",,16,True


### Make quotes_clean file smaller!

In [41]:
# import re
# output_df['quote_stripped'] = output_df['quote'].apply(lambda x: re.sub('[^a-z]', '', x.lower()))
# output_df.drop_duplicates(subset='quote_stripped', keep='first', inplace=True)
# output_df.drop(columns=['quote_stripped'], inplace=True)
# output_df


In [42]:
# output_df = output_df[output_df['author'].apply(lambda x: len(x.split()) <= 5)]
# output_df

In [43]:
# # Sample rows to get under 100mb for github
# final_output_df = output_df.sample(n=450000, random_state=1)

# final_output_df.to_csv('quotes_clean.csv', index=False)