# Dataset Cleaning
*(This notebook was inspired by Anton T. Ruberts' Dataset Cleaning notebook.)*
This dataset is my own collected data from surveys.

Since the table width is not too large, I decided to clean up the columns via Excel. 

The main objectives of this notebook are:
- Observe the contents of the dataset,
- melt the columns for characters and movies,
- separate data with proper separator from those with improper formatting,
- remove empty entries,
- export the cleaned data.

In [1]:
import pandas as pd
import numpy as np

## Loading the dataset

In [2]:
df = pd.read_csv('../data/Data 1 - Raw Survey Responses')
df.head(-10)

Unnamed: 0,timestamp,username,consent,age_confirmation,mbti,mbti_test,char_movie1,char_movie2,char_movie3,char_movie4,char_movie5,char_movie6,char_movie7,char_movie8,char_movie9,char_movie10,comments
0,4/29/2024 15:45:45,deadpool123,I agree.,"Yes, I am at least 18 years old.",ENFP,michaelcaloz.com,Tony Stark//Avengers: Endgame,Captain America//Avengers: Endgame,Miles Morales//Spider-Man: Across the Spider-V...,Hobie Brown//Spider-Man: Across the Spider-Verse,Peter Parker//Spider-Man: Far From Home,Bella Baxter//Poor Things,Jesse Wallace//Before Sunrise,Celine//Before Sunrise,Peter Parker//The Amazing Spider-Man,Rocket Raccoon//Guardians of the Galaxy Vol. 3,
1,4/29/2024 19:59:03,GBDymrBKBcsRMqef9Nyx#Kn#3LiihcF7#&ghPx!M,I agree.,"Yes, I am at least 18 years old.",INTP,16Personalities,Puss in Boots//Puss in Boots: The Last Wish,J. Robert Oppenheimer//Oppenheimer,Johnny English//Johnny English,Parzival (Wade Watts)//Ready Player One,Benoit Blanc//A Knives Out,Dr. Stephen Strange//Doctor Strange,Neo//The Matrix,"Captain Pete ""Maverick"" Mitchell//Top Gun: Mav...",Tony Stark//Avengers: Infinity War,Loki//Avengers: Infinity War,
2,4/30/2024 4:57:43,Lady-Orpheus,I agree.,"Yes, I am at least 18 years old.",INFP,Truity,Waymond Wang//Everything Everywhere All at Once,Ellen Ripley//Alien,Wall-E//WALL.E (2008),Mathilda//Mathilda,V//V for Vendetta,Morticia//The Addams Family,Amelie Poulain//Amelie,Willy Wonka//Willy Wonka & the Chocolate Facto...,Remus Lupin//Harry Potter,,I'd love to know more about how this data is g...
3,4/30/2024 8:07:45,Hiii,I agree.,"Yes, I am at least 18 years old.",ENFJ,Sakinorva,Grace Le domas//Ready or Not,Jack//Titanic,Wanda//Avengers: Endgame,Joy//Inside Out,,,,,,,
4,4/30/2024 8:24:01,azry705E2,I agree.,"Yes, I am at least 18 years old.",INTP,Cognitive functions self-typing,Yorichii // Demon Slayer,Amy// Gone Girl,Naomi Misora// Death Note,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
109,05/06/2024 21:30,X,I agree.,"Yes, I am at least 18 years old.",ESTJ,16Personalities,Hannibal Lecter//The Silence of the Lambs,Loki//The Avengers,Joker//The Dark Knight (2008),Batman//The Dark Knight (2008),Tony Stark//The Avengers,Dr. Frank-N-Furter//The Rocky Horror Picture Show,Riff Raff//The Rocky Horror Picture Show,Magenta//The Rocky Horror Picture Show,Scar//The Lion King,Hades//Hercules,MBTI is less accurate/useful than D&D alignmen...
110,05/06/2024 21:55,_Nathaniel_,I agree.,"Yes, I am at least 18 years old.",ENTJ,16Personalities,Tony Stark//Iron Man 3,J. Robert Oppenheimer//Oppenheimer,Noah Calhoun//The Notebook,Mitsuha and Taki//Your Name,,,,,,,
111,05/07/2024 11:12,prettybitch23,I agree.,"Yes, I am at least 18 years old.",ENTP,Cognitive functions self-typing,Harley Quinn//birds of prey,Olive Pendergast//Easy A,Bunny//Yeh Jawaani Hai Deewani,Captain America//Captain America,Percy Jackson//Percy Jackson And The Lightning...,Batman//Batman (2022),,,,,nopee
112,05/07/2024 11:49,8100Puppy,I agree.,"Yes, I am at least 18 years old.",ENFJ,16Personalities,Maria Von Trapp//The Sound of Music,Westley//The Princess Bride,Mikey//Goonies,Indiana Jones//Raiders of the Lost Ark,Princess Leia Organa//Star Wars,Samwise Gamgee//Lord of The Rings,Caractacus Potts//Chitty Chitty Bang Bang,Dorothy//The Wizard of Oz,Inigo Montoya//The Princess Bride,Professor MacGonagall//Harry Potter and the So...,


## Initial exploration of the data

In [3]:
print("Dataset shape:", df.shape)
print("Dataset columns:", df.columns)

Dataset shape: (124, 17)
Dataset columns: Index(['timestamp', 'username', 'consent', 'age_confirmation', 'mbti',
       'mbti_test', 'char_movie1', 'char_movie2', 'char_movie3', 'char_movie4',
       'char_movie5', 'char_movie6', 'char_movie7', 'char_movie8',
       'char_movie9', 'char_movie10', 'comments'],
      dtype='object')


**Observations**
- Some columns are not necessary.

**Actions(s)**
- Eliminate "consent" and text-id columns. 
- Insert index column as id.

In [4]:
df_dropped = df.drop(['timestamp', 'username', 'consent', 'age_confirmation', 'comments'], axis = 1)

# Add id column before melting, then rename `index` as `id`
df_dropped.reset_index(inplace=True)
df_dropped.rename(columns = {'index':'id', 'mbti':'person_mbti'}, inplace = True)
df_dropped

Unnamed: 0,id,person_mbti,mbti_test,char_movie1,char_movie2,char_movie3,char_movie4,char_movie5,char_movie6,char_movie7,char_movie8,char_movie9,char_movie10
0,0,ENFP,michaelcaloz.com,Tony Stark//Avengers: Endgame,Captain America//Avengers: Endgame,Miles Morales//Spider-Man: Across the Spider-V...,Hobie Brown//Spider-Man: Across the Spider-Verse,Peter Parker//Spider-Man: Far From Home,Bella Baxter//Poor Things,Jesse Wallace//Before Sunrise,Celine//Before Sunrise,Peter Parker//The Amazing Spider-Man,Rocket Raccoon//Guardians of the Galaxy Vol. 3
1,1,INTP,16Personalities,Puss in Boots//Puss in Boots: The Last Wish,J. Robert Oppenheimer//Oppenheimer,Johnny English//Johnny English,Parzival (Wade Watts)//Ready Player One,Benoit Blanc//A Knives Out,Dr. Stephen Strange//Doctor Strange,Neo//The Matrix,"Captain Pete ""Maverick"" Mitchell//Top Gun: Mav...",Tony Stark//Avengers: Infinity War,Loki//Avengers: Infinity War
2,2,INFP,Truity,Waymond Wang//Everything Everywhere All at Once,Ellen Ripley//Alien,Wall-E//WALL.E (2008),Mathilda//Mathilda,V//V for Vendetta,Morticia//The Addams Family,Amelie Poulain//Amelie,Willy Wonka//Willy Wonka & the Chocolate Facto...,Remus Lupin//Harry Potter,
3,3,ENFJ,Sakinorva,Grace Le domas//Ready or Not,Jack//Titanic,Wanda//Avengers: Endgame,Joy//Inside Out,,,,,,
4,4,INTP,Cognitive functions self-typing,Yorichii // Demon Slayer,Amy// Gone Girl,Naomi Misora// Death Note,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
119,119,ISFP,16Personalities,Ip Man//Ip Man (2008),Tin Tin//The Adventures Of Tin Tin,Dolores//Encanto,Captain America//Captain America,Gandalf//Lord Of The Rings,,,,,
120,120,ISTP,16Personalities,Katniss Everdeen//The Hunger Games (2012),Tony Stark//Avengers,Colt Seavers//The Fall Guy,The Big Friendly Giant//The BFG,Peter Quill//Guardians of the Galaxy,,,,,
121,121,ESFJ,16Personalities,Jack Sparrow//Pirates of the Caribbean,Neville Longbottom//Harry Potter,Joker//The Dark Knight (2008),Tony Stark// Iron Man,,,,,,
122,122,ESTJ,Cognitive functions self-typing,Severus Snape // Harry Potter,Yoda // star wars,Kiri // avatar 2,Terminator (Arnold Schwarzenegger) // terminat...,,,,,,


## Melt for each column

In [5]:
df_melt = pd.melt(df_dropped, id_vars=['id', 'person_mbti', 'mbti_test'], var_name='char_entry', value_name='char_movie', col_level=None, ignore_index=True)
df_melt.drop(['char_entry'], axis=1, inplace=True)
df_melt.sort_values('id', ascending=True)
df_melt.dropna(inplace=True)
df_melt


Unnamed: 0,id,person_mbti,mbti_test,char_movie
0,0,ENFP,michaelcaloz.com,Tony Stark//Avengers: Endgame
1,1,INTP,16Personalities,Puss in Boots//Puss in Boots: The Last Wish
2,2,INFP,Truity,Waymond Wang//Everything Everywhere All at Once
3,3,ENFJ,Sakinorva,Grace Le domas//Ready or Not
4,4,INTP,Cognitive functions self-typing,Yorichii // Demon Slayer
...,...,...,...,...
1222,106,ENTJ,16Personalities,Eliza Doolittle//My Fair Lady
1223,107,INFP,Sakinorva,Batman// The Batman
1225,109,ESTJ,16Personalities,Hades//Hercules
1228,112,ENFJ,16Personalities,Professor MacGonagall//Harry Potter and the So...


## Split data


In [6]:
df_movie_char_split = df_melt
new_info = df_movie_char_split['char_movie'].str.split('//', n=1, expand=True)
df_movie_char_split['character'] = new_info[0]
df_movie_char_split['movie'] = new_info[1]
df_movie_char_split.drop(columns=['char_movie'], inplace=True)
df_movie_char_split.sort_values('id', inplace=True)
print(df_movie_char_split)

       id person_mbti         mbti_test         character  \
0       0        ENFP  michaelcaloz.com        Tony Stark   
1116    0        ENFP  michaelcaloz.com    Rocket Raccoon   
124     0        ENFP  michaelcaloz.com   Captain America   
496     0        ENFP  michaelcaloz.com      Peter Parker   
372     0        ENFP  michaelcaloz.com       Hobie Brown   
...   ...         ...               ...               ...   
743   123        ENFP  michaelcaloz.com       Agent Smith   
247   123        ENFP  michaelcaloz.com  Daniel Plainview   
371   123        ENFP  michaelcaloz.com   Eraserhead Baby   
867   123        ENFP  michaelcaloz.com   Patrick Bateman   
495   123        ENFP  michaelcaloz.com          The Duke   

                                    movie  
0                       Avengers: Endgame  
1116       Guardians of the Galaxy Vol. 3  
124                     Avengers: Endgame  
496             Spider-Man: Far From Home  
372   Spider-Man: Across the Spider-Verse  
...

In [7]:
df_movie_char_split.to_csv('../data/Data 2 - Transformed Survey Responses')