#**Making Sense of K-Pop Using Machine Learning | Part 1 - Data Collection & Cleaning**

Author: Jaemin Lee (aka. Import Data)

Video version with explanation: [here](https://www.youtube.com/channel/UCYDacpfRrCX6_8oDDlzTgFw)

# Import Basic Libraries & Connect to my Gdrive

In [0]:
import pandas as pd
import numpy as np
import sklearn

# Import Data

In [0]:
df = pd.read_excel("/content/KPOP DATA.xlsx")

# Data Cleaning



In [0]:
# function to remove the leading and trailing whtte space in the data frame
def trim(dataset):
  # using .strip() to remove the leading and the trailing white spaces in each cell
  trim = lambda x: x.strip() if type(x) is str else x
  return dataset.applymap(trim)

df = trim(df)

In [0]:
# remove timestamp col using index
df.drop(df.columns[[0]], axis = 1, inplace = True)

In [5]:
df.head()

Unnamed: 0,Which is your favourite K-Pop group?,Is K-Pop popular in your country?,Why do you like K-Pop?,When did you start listening to K-Pop?,Do you listen to K-Pop male groups more than girl group songs?,How many hours do you spend listening to K-Pop?,Do you watch K-Pop Youtube videos?,"If you do watch K-Pop music videos, how long do you spend watching them?",Would you say that you are obsessed with K-Pop?,What do you do to keep up with K-Pop news?,Has K-Pop led you to pursue any of the following?,Has consuming K-Pop taken away most of your time?,Has consuming K-Pop affected your personal life in the following ways?,Have you experienced any positive effects after consuming K-Pop?,"On an average, in one year, how much do you spend on K-Pop merchandise?","If you do spend money on K-Pop merchandise, where do you get money from?",Do you attend K-Pop concerts?,What is the craziest thing you have done in your pursuit and love for K-Pop?,How old are you?,Which country are you from?,What is your profession?,What is your gender?
0,BTS,Its gaining popularity,Its different from the usual music. I like som...,1 -2 years ago,"Its about the music, not the gender",5 or more,Yes,2-3 hours,Not sure,Subscribe to K-Pop news sites;Subscribe to You...,Learning about Korean/Asian culture;Pursue dan...,No,Led to you being cajoled (made fun of) by clas...,Reduced stress/anxiety/depression,50 - 100 $,I have a full - time job,Yes,Missed days of work and school to attend concerts,15 - 20 years,A country in the UK,Sales,Female
1,BTS,Its gaining popularity,Its different from the usual music. I like som...,3-4 years ago,I listen to both,2-4 hours,Yes,2-3 hours,Yes,Subscribe to K-Pop news sites;Subscribe to You...,Learning music - either an instrument or singi...,Yes,Led to you being cajoled (made fun of) by clas...,Made friends who are like-minded,Below 50$,Working part-time,"I want to, but the K-Pop groups don't come to ...",Listened to K-Pop without studying or working ...,15 - 20 years,A country in the UK,Student,Female
2,BTS;MONSTA X,Its gaining popularity,Its different from the usual music. I like som...,More than 4 years ago,"Its about the music, not the gender",2-4 hours,Yes,2-3 hours,No,Subscribe to K-Pop news sites;Join K-pop socia...,Learning about Korean/Asian culture;Learn abou...,No,Led to you being cajoled (made fun of) by clas...,Made friends who are like-minded,50 - 100 $,Working part-time,Yes,Nothing as such,15 - 20 years,Other European countries,Student,Female
3,BTS,Its gaining popularity,Its different from the usual music. I like som...,3-4 years ago,I listen to both,5 or more,Yes,3-4 hours,No,Subscribe to K-Pop news sites;Subscribe to You...,Learning about Korean/Asian culture;Visit Korea,No,Led to you being cajoled (made fun of) by clas...,Reduced stress/anxiety/depression,50 - 100 $,Working part-time,Yes,Nothing as such,21 - 26 years,USA,Student,Female
4,BTS,Yes,Its different from the usual music. I like som...,3-4 years ago,Yes,5 or more,Yes,2-3 hours,Not sure,Subscribe to Youtube channels,Learning about Korean/Asian culture,Yes,Led to you being cajoled (made fun of) by clas...,Reduced stress/anxiety/depression,100 - 200$,Parents,"I want to, but the K-Pop groups don't come to ...",Nothing as such,15 - 20 years,USA,Student,Female


In [0]:
# rename column names as they are long
df = df.rename(columns = {'Which is your favourite K-Pop group?': 'fav_grp',
                          'Is K-Pop popular in your country? ': 'popl_by_co_yn',
                          'Why do you like K-Pop?': 'reason',
                          'When did you start listening to K-Pop?': 'yr_listened',
                          'Do you listen to K-Pop male groups more than girl group songs?': 'gender_pref',
                          'How many hours do you spend listening to K-Pop?': 'daily_music_hr',
                          'Do you watch K-Pop Youtube videos?': 'watch_MV_yn',
                          'If you do watch K-Pop music videos, how long do you spend watching them?': 'daily_MV_hr',
                          'Would you say that you are obsessed with K-Pop?': 'obsessed_yn',
                          'What do you do to keep up with K-Pop news?': 'news_medium',
                          'Has K-Pop led you to pursue any of the following?': 'pursuit',
                          'Has consuming K-Pop taken away most of your time?': 'time_cons_yn',
                          'Has consuming K-Pop affected your personal life in the following ways?': 'life_chg',
                          'Have you experienced any positive effects after consuming K-Pop? ': 'pos_eff',
                          'On an average, in one year, how much do you spend on K-Pop merchandise?': 'yr_merch_spent',
                          'If you do spend money on K-Pop merchandise, where do you get money from?': 'money_src',
                          'Do you attend K-Pop concerts?': 'concert_yn',
                          'What is the craziest thing you have done in your pursuit and love for K-Pop?': 'crazy_ev',
                          'How old are you?': 'age',
                          'Which country are you from?': 'country',
                          'What is your profession?': 'job',
                          'What is your gender?': 'gender'
                          })

In [7]:
## check for null value
df.isnull().sum()

fav_grp           0
popl_by_co_yn     0
reason            0
yr_listened       0
gender_pref       0
daily_music_hr    0
watch_MV_yn       0
daily_MV_hr       5
obsessed_yn       0
news_medium       0
pursuit           0
time_cons_yn      0
life_chg          1
pos_eff           0
yr_merch_spent    0
money_src         1
concert_yn        0
crazy_ev          0
age               0
country           0
job               0
gender            0
dtype: int64

Replace the null value in the "life_chg" column

In [8]:
# simply replace it with "none"
df["life_chg"].fillna("none", inplace = True)
df.isnull().sum()

fav_grp           0
popl_by_co_yn     0
reason            0
yr_listened       0
gender_pref       0
daily_music_hr    0
watch_MV_yn       0
daily_MV_hr       5
obsessed_yn       0
news_medium       0
pursuit           0
time_cons_yn      0
life_chg          0
pos_eff           0
yr_merch_spent    0
money_src         1
concert_yn        0
crazy_ev          0
age               0
country           0
job               0
gender            0
dtype: int64

Replace the null value in the "money_src" column

In [9]:
# simply replace it with "none"
df["money_src"].fillna("none", inplace = True)
df.isnull().sum()

fav_grp           0
popl_by_co_yn     0
reason            0
yr_listened       0
gender_pref       0
daily_music_hr    0
watch_MV_yn       0
daily_MV_hr       5
obsessed_yn       0
news_medium       0
pursuit           0
time_cons_yn      0
life_chg          0
pos_eff           0
yr_merch_spent    0
money_src         0
concert_yn        0
crazy_ev          0
age               0
country           0
job               0
gender            0
dtype: int64

Cleaning the null values in the "daily_MV_hr" column


In [10]:
# get the hang of what the values look like
df["daily_MV_hr"].value_counts()

1 hour                  118
2-3 hours                84
3-4 hours                20
More than four hours     13
Name: daily_MV_hr, dtype: int64

In [11]:
# replace the null with the average
df["daily_MV_hr"].fillna("2.5", inplace = True)
df.isnull().sum()

fav_grp           0
popl_by_co_yn     0
reason            0
yr_listened       0
gender_pref       0
daily_music_hr    0
watch_MV_yn       0
daily_MV_hr       0
obsessed_yn       0
news_medium       0
pursuit           0
time_cons_yn      0
life_chg          0
pos_eff           0
yr_merch_spent    0
money_src         0
concert_yn        0
crazy_ev          0
age               0
country           0
job               0
gender            0
dtype: int64

In [12]:
# convert "More than four hours" to 4.5
daily_mv = df["daily_MV_hr"]
daily_mv = daily_mv.str.replace('More than four hours', '4.5')
daily_mv.value_counts()

1 hour       118
2-3 hours     84
3-4 hours     20
4.5           13
2.5            5
Name: daily_MV_hr, dtype: int64

In [13]:
# parse the hours
daily_mv = daily_mv.apply(lambda x: x.split(' ')[0]) # look for the whitespace, split it and get the first element
daily_mv.value_counts()

1      118
2-3     84
3-4     20
4.5     13
2.5      5
Name: daily_MV_hr, dtype: int64

In [0]:
# function to find the mean when some have ranges and others don't
def split_mean(x):
  # split before and after the hyphen (-)
  split_num = x.split("-")
  if len(split_num) == 2:
     return (float(split_num[0])+float(split_num[1]))/2
  # those who aren't in the range
  else:
     return float(x)

# apply the split_mean function to the "daily MV hours" column
daily_mv = daily_mv.apply(lambda x: split_mean(x))

In [15]:
# overwrite it to the original dataset
df["daily_MV_hr"] = daily_mv

df["daily_MV_hr"].value_counts()

1.0    118
2.5     89
3.5     20
4.5     13
Name: daily_MV_hr, dtype: int64

Cleaning the "yr_listened" column

In [16]:
df["yr_listened"].value_counts()

3-4 years ago                                                                                                                                                                                               89
More than 4 years ago                                                                                                                                                                                       69
1 -2 years ago                                                                                                                                                                                              66
Less than a year ago                                                                                                                                                                                         7
7 years ago                                                                                                                                                                 

In [17]:
# replace certain strings
df["yr_listened"] = df["yr_listened"].str.replace("3-4 years ago", "3.5")
df["yr_listened"] = df["yr_listened"].str.replace("1 -2 years ago", "1.5")
df["yr_listened"] = df["yr_listened"].str.replace("More than 4 years ago", "4")
df["yr_listened"] = df["yr_listened"].str.replace("Less than a year ago", "1")
df["yr_listened"] = df["yr_listened"].str.replace("Iâ€™ve been listening to it all my life basically, my cousins are Asian, and Iâ€™m mexican, so people think Iâ€™m just a koreaboo, when in reality Iâ€™ve only ever listened to korean and Chinese music", "6")
df["yr_listened"] = df["yr_listened"].str.replace("About 6 years ago, I got introduced to 2ne1 which was their song called â€œI am the bestâ€. So sad they disbanded", "6")
df["yr_listened"] = df["yr_listened"].str.replace("Started in 2006", "14")

df["yr_listened"].value_counts()

3.5                     89
4                       69
1.5                     66
1                        7
7 years ago              2
6                        2
7 years                  1
9 years coming April     1
9 years, since 2010      1
14                       1
8+ years ago             1
Name: yr_listened, dtype: int64

In [18]:
# parse the years
yrs_listen = df["yr_listened"].apply(lambda x: x.split(' ')[0]) # look for the whitespace, split it and get the first element
yrs_listen.value_counts()

3.5    89
4      69
1.5    66
1       7
7       3
6       2
9       2
8+      1
14      1
Name: yr_listened, dtype: int64

In [19]:
# remove the + sign
yrs_listen = yrs_listen.apply(lambda x: x.replace('+', ""))
yrs_listen.value_counts()

3.5    89
4      69
1.5    66
1       7
7       3
6       2
9       2
14      1
8       1
Name: yr_listened, dtype: int64

Cleaning the "daily_music_hr" column

In [20]:
df["daily_music_hr"].value_counts()

2-4 hours                                                                                                                                                                                                                                                                                                                                                               106
5 or more                                                                                                                                                                                                                                                                                                                                                                99
Less than an hour                                                                                                                                                                                                                                                               

In [21]:
# convert the timestamp (2020-07-24 00:00:00) to 5 or more
df.iloc[156, 5] = "5 or more"
df.iloc[170, 5] = "5 or more"

df["daily_music_hr"].value_counts()

2-4 hours                                                                                                                                                                                                                                                                                                                                                               106
5 or more                                                                                                                                                                                                                                                                                                                                                               101
Less than an hour                                                                                                                                                                                                                                                               

In [22]:
# parse the hour
df["daily_music_hr"] = df["daily_music_hr"].apply(lambda x: x.split(" ")[0])
df["daily_music_hr"].value_counts()

2-4          106
5            101
Less          11
Everyday       2
All            2
I              2
The            1
Over           1
Everyday,      1
Almost         1
depends        1
Hours          1
It's           1
Anytime        1
Most           1
Any            1
Every          1
Eh             1
Nearly         1
Itâ€™s         1
Idk            1
Whenever       1
Name: daily_music_hr, dtype: int64

In [23]:
# get the location of "daily music hourst" for the for loop
df.columns.get_loc("daily_music_hr")

5

In [24]:
# write a for loop to clean the data
for row in range(0, len(df["daily_music_hr"])):
  if '2-4' in df.iloc[row, 5]:
    # mean of 2+4
    df.iloc[row, 5] = "3"
  elif '5' in df.iloc[row, 5]:
    df.iloc[row, 5] = "5"
  elif 'Less' in df.iloc[row, 5]:
    df.iloc[row, 5] = "0.5"
  else:
    df.iloc[row, 5] = "10"

df["daily_music_hr"].value_counts()

3      106
5      101
10      22
0.5     11
Name: daily_music_hr, dtype: int64

Clean the "yr_merch_spent" column

In [25]:
df["yr_merch_spent"].value_counts()

Below 50$                                                                                                                                         91
50 - 100 $                                                                                                                                        57
I don't spend anything on merchandise                                                                                                             49
100 - 200$                                                                                                                                        25
A lot I donâ€™t even want to know how much                                                                                                         1
Iâ€™ve spent 200-300 dollars, but i chose not to spend much more unless its clothing or a physical concert. Plus i rlly only ult stan 2 groups     1
$2,000                                                                                                    

In [26]:
# get the location of "yr_merch_spent" for the for loop
df.columns.get_loc("yr_merch_spent")

14

In [27]:
# write a for loop to clean the data
for row in range(0, len(df["yr_merch_spent"])):
  if 'Below 50' in df.iloc[row, 14]:
    df.iloc[row, 14] = "50"
  elif '50 - 100' in df.iloc[row, 14]:
    df.iloc[row, 14] = "75"
  elif "I don't spend anything on merchandise" in df.iloc[row, 14]:
    df.iloc[row, 14] = "0"
  elif '100 - 200' in df.iloc[row, 14]:
    df.iloc[row, 14] = "150"
  elif 'Below 30' in df.iloc[row, 14]:
    df.iloc[row, 14] = "30"
  elif '10?' in df.iloc[row, 14]:
    df.iloc[row, 14] = "10"
  else:
    df.iloc[row, 14] = "500"

df["yr_merch_spent"].value_counts()

50     91
75     57
0      49
150    25
500    16
30      1
10      1
Name: yr_merch_spent, dtype: int64

Clean the "age" column

In [28]:
df["age"].value_counts()

15 - 20 years    166
21 - 26 years     37
10 - 14 years     34
27 - 30            3
Name: age, dtype: int64

In [29]:
# remove "years"
age = df["age"].apply(lambda x: x.split('y')[0])
age = age.str.replace(" ", "")
age.value_counts()

15-20    166
21-26     37
10-14     34
27-30      3
Name: age, dtype: int64

In [30]:
# apply split mean on age
df["age"] = age.apply(lambda x: split_mean(x))
df["age"].value_counts()

17.5    166
23.5     37
12.0     34
28.5      3
Name: age, dtype: int64

In [31]:
# round up the age
df["age"] = np.ceil(df["age"]).astype(int)
df["age"].value_counts()

18    166
24     37
12     34
29      3
Name: age, dtype: int64

Clean "fav_grp"

In [32]:
df['fav_grp'].value_counts()

BTS                                                    104
EXO                                                      7
Stray Kids                                               7
MONSTA X                                                 7
BLACKPINK                                                5
                                                      ... 
LOONA                                                    1
Bigbang, GOT7 and all the above                          1
INFINITE                                                 1
BTS;EXO;MONSTA X;Ateez, nct, GOT7 too many to count      1
Vixx and Super Junior                                    1
Name: fav_grp, Length: 93, dtype: int64

In [0]:
# check how the groups are separated
# ; and , / &
# replace the above with commas
grp = df['fav_grp']

grp = grp.apply(lambda x: x.lower().replace(";", ",").replace(" and ", ",").replace(", ", ",").replace(" / ", ",").replace(" & ", ",").replace(". ", ","))

In [0]:
# grp[0].count(",")+1

# create a function that returns the num of groups they like
def num_grp_like(df):
  tmpArr = []
  for i in range(0, 240):
    num_grp = grp[i].count(",")+1
    tmpArr.append(num_grp)
  return tmpArr

In [35]:
# append a new column 
df['num_gr_like'] = num_grp_like(df['fav_grp'])
df['num_gr_like'].value_counts()

1     165
2      24
3      17
5      12
4      10
6       5
7       3
8       2
37      1
20      1
Name: num_gr_like, dtype: int64

In [36]:
df['fav_grp'] = grp
df.head()

Unnamed: 0,fav_grp,popl_by_co_yn,reason,yr_listened,gender_pref,daily_music_hr,watch_MV_yn,daily_MV_hr,obsessed_yn,news_medium,pursuit,time_cons_yn,life_chg,pos_eff,yr_merch_spent,money_src,concert_yn,crazy_ev,age,country,job,gender,num_gr_like
0,bts,Its gaining popularity,Its different from the usual music. I like som...,1.5,"Its about the music, not the gender",5,Yes,2.5,Not sure,Subscribe to K-Pop news sites;Subscribe to You...,Learning about Korean/Asian culture;Pursue dan...,No,Led to you being cajoled (made fun of) by clas...,Reduced stress/anxiety/depression,75,I have a full - time job,Yes,Missed days of work and school to attend concerts,18,A country in the UK,Sales,Female,1
1,bts,Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,3,Yes,2.5,Yes,Subscribe to K-Pop news sites;Subscribe to You...,Learning music - either an instrument or singi...,Yes,Led to you being cajoled (made fun of) by clas...,Made friends who are like-minded,50,Working part-time,"I want to, but the K-Pop groups don't come to ...",Listened to K-Pop without studying or working ...,18,A country in the UK,Student,Female,1
2,"bts,monsta x",Its gaining popularity,Its different from the usual music. I like som...,4.0,"Its about the music, not the gender",3,Yes,2.5,No,Subscribe to K-Pop news sites;Join K-pop socia...,Learning about Korean/Asian culture;Learn abou...,No,Led to you being cajoled (made fun of) by clas...,Made friends who are like-minded,75,Working part-time,Yes,Nothing as such,18,Other European countries,Student,Female,2
3,bts,Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,5,Yes,3.5,No,Subscribe to K-Pop news sites;Subscribe to You...,Learning about Korean/Asian culture;Visit Korea,No,Led to you being cajoled (made fun of) by clas...,Reduced stress/anxiety/depression,75,Working part-time,Yes,Nothing as such,24,USA,Student,Female,1
4,bts,Yes,Its different from the usual music. I like som...,3.5,Yes,5,Yes,2.5,Not sure,Subscribe to Youtube channels,Learning about Korean/Asian culture,Yes,Led to you being cajoled (made fun of) by clas...,Reduced stress/anxiety/depression,150,Parents,"I want to, but the K-Pop groups don't come to ...",Nothing as such,18,USA,Student,Female,1


In [37]:
bts_vs_others = df['fav_grp']

for row in range(0, 240):
  if "bts" in bts_vs_others.iloc[row]:
    bts_vs_others.iloc[row] = "bts"
  else:
    bts_vs_others.iloc[row] = "other(s)"

df["bts_vs_others"] = bts_vs_others

df.bts_vs_others.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


bts         163
other(s)     77
Name: bts_vs_others, dtype: int64

In [38]:
df.head(10)

Unnamed: 0,fav_grp,popl_by_co_yn,reason,yr_listened,gender_pref,daily_music_hr,watch_MV_yn,daily_MV_hr,obsessed_yn,news_medium,pursuit,time_cons_yn,life_chg,pos_eff,yr_merch_spent,money_src,concert_yn,crazy_ev,age,country,job,gender,num_gr_like,bts_vs_others
0,bts,Its gaining popularity,Its different from the usual music. I like som...,1.5,"Its about the music, not the gender",5,Yes,2.5,Not sure,Subscribe to K-Pop news sites;Subscribe to You...,Learning about Korean/Asian culture;Pursue dan...,No,Led to you being cajoled (made fun of) by clas...,Reduced stress/anxiety/depression,75,I have a full - time job,Yes,Missed days of work and school to attend concerts,18,A country in the UK,Sales,Female,1,bts
1,bts,Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,3,Yes,2.5,Yes,Subscribe to K-Pop news sites;Subscribe to You...,Learning music - either an instrument or singi...,Yes,Led to you being cajoled (made fun of) by clas...,Made friends who are like-minded,50,Working part-time,"I want to, but the K-Pop groups don't come to ...",Listened to K-Pop without studying or working ...,18,A country in the UK,Student,Female,1,bts
2,bts,Its gaining popularity,Its different from the usual music. I like som...,4.0,"Its about the music, not the gender",3,Yes,2.5,No,Subscribe to K-Pop news sites;Join K-pop socia...,Learning about Korean/Asian culture;Learn abou...,No,Led to you being cajoled (made fun of) by clas...,Made friends who are like-minded,75,Working part-time,Yes,Nothing as such,18,Other European countries,Student,Female,2,bts
3,bts,Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,5,Yes,3.5,No,Subscribe to K-Pop news sites;Subscribe to You...,Learning about Korean/Asian culture;Visit Korea,No,Led to you being cajoled (made fun of) by clas...,Reduced stress/anxiety/depression,75,Working part-time,Yes,Nothing as such,24,USA,Student,Female,1,bts
4,bts,Yes,Its different from the usual music. I like som...,3.5,Yes,5,Yes,2.5,Not sure,Subscribe to Youtube channels,Learning about Korean/Asian culture,Yes,Led to you being cajoled (made fun of) by clas...,Reduced stress/anxiety/depression,150,Parents,"I want to, but the K-Pop groups don't come to ...",Nothing as such,18,USA,Student,Female,1,bts
5,bts,Its gaining popularity,Its different from the usual music. I like som...,4.0,Yes,3,Yes,1.0,No,Subscribe to K-Pop news sites,Learning about Korean/Asian culture,No,Led to you being cajoled (made fun of) by clas...,Made friends who are like-minded,50,Working part-time,"I want to, but the K-Pop groups don't come to ...",Constantly avoid going out with friends/family...,18,USA,Student,Female,5,bts
6,bts,I'm not sure,The idols connect with their fans in a way tha...,3.5,I listen to both,5,Yes,1.0,Yes,Subscribe to Youtube channels,Learn about Korean fashion/makeup,Yes,Reduced the amount of sleep you get,Made friends who are like-minded,50,Parents,"I want to, but the K-Pop groups don't come to ...",Nothing as such,12,Canada,Student,Female,1,bts
7,bts,Yes,Its different from the usual music. I like som...,4.0,I listen to both,3,Yes,1.0,Yes,Subscribe to K-Pop news sites;Subscribe to You...,Learning about Korean/Asian culture;Pursue dan...,Not sure,Nothing really,Reduced stress/anxiety/depression,50,Parents,Yes,Nothing as such,18,USA,Student,Female,1,bts
8,other(s),Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,5,Yes,2.5,Not sure,Subscribe to Youtube channels,Learning about Korean/Asian culture;Pursue dan...,Yes,Reduced the amount of sleep you get,Made friends who are like-minded,50,Parents,"I want to, but the K-Pop groups don't come to ...",Listened to K-Pop without studying or working ...,12,A country in the UK,Student,Female,2,other(s)
9,bts,Yes,Its different from the usual music. I like som...,4.0,I listen to both,10,Yes,1.0,No,Subscribe to Youtube channels,Learning about Korean/Asian culture;Learn abou...,No,Nope,Reduced stress/anxiety/depression,0,Working part-time,"I want to, but the K-Pop groups don't come to ...",Nothing as such,18,USA,Student,Female,1,bts


Clean "news_medium"

In [39]:
# check for if anything else needs to be re-labeled
df["news_medium"].value_counts()

Subscribe to Youtube channels                                                                                    54
Join K-pop social media groups                                                                                   38
Subscribe to K-Pop news sites;Subscribe to Youtube channels;Join K-pop social media groups                       37
Subscribe to Youtube channels;Join K-pop social media groups                                                     31
Subscribe to K-Pop news sites                                                                                    17
Subscribe to K-Pop news sites;Subscribe to Youtube channels                                                      15
Subscribe to K-Pop news sites;Join K-pop social media groups                                                      6
Twitter                                                                                                           5
Subscribe to Youtube channels;Join K-pop social media groups;Keep update

In [40]:
# re-label news medium 
# YT, Social media(twitter, instagram), both YT and Social media, others(reddit, tumblr, etc)
news = df['news_medium']

news = news.apply(lambda x: x.lower())

for row in range(0, 240):
  if "youtube" in news.iloc[row]:
    news.iloc[row] = "youtube"
  elif "twitter" and "instagram" and "social media" in news.iloc[row]:
    news.iloc[row] = "social media (twitter, instagram)"
  else:
    news.iloc[row] = "others (reddit, tumbler, or none)"

news.value_counts()

df['news_medium'] = news
df['news_medium'].value_counts()

youtube                              153
social media (twitter, instagram)     47
others (reddit, tumbler, or none)     40
Name: news_medium, dtype: int64

Clean "pursuit"

In [41]:
df["pursuit"].value_counts()

Learning about Korean/Asian culture                                                                                                                                                                                                                  36
Learning about Korean/Asian culture;Learn about Korean fashion/makeup                                                                                                                                                                                23
Learning about Korean/Asian culture;Pursue dancing;Learn about Korean fashion/makeup;Visit Korea                                                                                                                                                     19
Learning music - either an instrument or singing;Learning about Korean/Asian culture;Pursue dancing;Learn about Korean fashion/makeup;Visit Korea                                                                                                    16
Learning

In [0]:
pursue = df['pursuit']
pursue_test = pursue

In [0]:
# re-label pursuit - 
# Learning about Korean/Asian culture
# Learn about Korean fashion/makeup
# Learning music - either an instrument or singing
# pursue dancing
# others

pursue_test = pursue_test.apply(lambda x: x.lower())

for row in range(0, 240):
  if "learning about korean/asian culture" in pursue_test.iloc[row] and ";" not in pursue_test.iloc[row]:
    pursue_test.iloc[row] = "learn the culture"
  elif "learn about korean fashion/makeup" in pursue_test.iloc[row] and ";" not in pursue_test.iloc[row]:
    pursue_test.iloc[row] = "learn korean fashion/makeup"
  elif "learning music - either an instrument or singing" in pursue_test.iloc[row] and ";" not in pursue_test.iloc[row]:
    pursue_test.iloc[row] = "learn music"
  elif "pursue dancing" in pursue_test.iloc[row] and ";" not in pursue_test.iloc[row]:
    pursue_test.iloc[row] = "pursue dancing"
  elif "no" in pursue_test.iloc[row] and ";" not in pursue_test.iloc[row]:
    pursue_test.iloc[row] = "none"
  else:
    pursue_test.iloc[row] = "others (combination of the the four and visit korea)"

In [44]:
#pursue_test.value_counts()

df["pursuit"] = pursue_test
df.pursuit.value_counts()

others (combination of the the four and visit korea)    182
learn the culture                                        36
pursue dancing                                            6
learn korean fashion/makeup                               6
none                                                      5
learn music                                               5
Name: pursuit, dtype: int64

Cleaning "time_cons_yn"

In [45]:
df['time_cons_yn'].value_counts()

No                                                                                                                                                                                                                                                     115
Yes                                                                                                                                                                                                                                                     69
Not sure                                                                                                                                                                                                                                                34
It used to but now Iâ€™ve learnt to control it                                                                                                                                                                                                         

In [0]:
time_cons = df["time_cons_yn"]
time_cons_test = time_cons

In [47]:
# re-label time_cons_yn 
# yes, no, not sure (depends), sometimes, used to, others

time_cons_test = time_cons_test.apply(lambda x: x.lower())

for row in range(0, 240):
  if "yes" in time_cons_test.iloc[row]:
    time_cons_test.iloc[row] = "yes"
  elif "no" in time_cons_test.iloc[row]:
    time_cons_test.iloc[row] = "no"
  else:
    time_cons_test.iloc[row] = "others (sometimes, not sure)"

time_cons_test.value_counts()

no                              155
yes                              74
others (sometimes, not sure)     11
Name: time_cons_yn, dtype: int64

In [0]:
# overwrite it to df
df["time_cons_yn"] = time_cons_test

Clean "life_chg"

In [49]:
df.life_chg.value_counts()

Led to you being cajoled (made fun of) by classmates and family                                                                                                                   93
Reduced the amount of sleep you get                                                                                                                                               71
Reduced your time to socialise with your friends/partners/family                                                                                                                  10
Reduced the number of friends you have                                                                                                                                             7
None                                                                                                                                                                               5
None of the above                                                                              

In [0]:
lf_chg = df["life_chg"]
lf_chg_test = lf_chg

In [51]:
# re-label life_chg
# Led to you being cajoled (made fun of) by classmates and family
# Reduced the amount of sleep you get
# Reduced your time to socialise with your friends/partners/family
# none (no, not negative)
# others (combination of the three, small criticism)

lf_chg_test = lf_chg_test.apply(lambda x: x.lower())

for row in range(0, 240):
  if "cajoled" in lf_chg_test.iloc[row]:
    lf_chg_test.iloc[row] = "made fun of"
  elif "amount of sleep" in lf_chg_test.iloc[row]:
    lf_chg_test.iloc[row] = "reduced amount of sleep"
  elif "friends" and "friends" in lf_chg_test.iloc[row]:
    lf_chg_test.iloc[row] = "reduced time of socialized with friends/family"
  elif "no" in lf_chg_test.iloc[row]:
    lf_chg_test.iloc[row] = "none"
  else:
    lf_chg_test.iloc[row] = "others (combination of the negatives, small criticism)"

lf_chg_test.value_counts()

made fun of                                               93
reduced amount of sleep                                   72
none                                                      38
reduced time of socialized with friends/family            26
others (combination of the negatives, small criticism)    11
Name: life_chg, dtype: int64

In [52]:
df["life_chg"] = lf_chg_test
df.head(10)

Unnamed: 0,fav_grp,popl_by_co_yn,reason,yr_listened,gender_pref,daily_music_hr,watch_MV_yn,daily_MV_hr,obsessed_yn,news_medium,pursuit,time_cons_yn,life_chg,pos_eff,yr_merch_spent,money_src,concert_yn,crazy_ev,age,country,job,gender,num_gr_like,bts_vs_others
0,bts,Its gaining popularity,Its different from the usual music. I like som...,1.5,"Its about the music, not the gender",5,Yes,2.5,Not sure,youtube,others (combination of the the four and visit ...,no,made fun of,Reduced stress/anxiety/depression,75,I have a full - time job,Yes,Missed days of work and school to attend concerts,18,A country in the UK,Sales,Female,1,bts
1,bts,Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,3,Yes,2.5,Yes,youtube,others (combination of the the four and visit ...,yes,made fun of,Made friends who are like-minded,50,Working part-time,"I want to, but the K-Pop groups don't come to ...",Listened to K-Pop without studying or working ...,18,A country in the UK,Student,Female,1,bts
2,bts,Its gaining popularity,Its different from the usual music. I like som...,4.0,"Its about the music, not the gender",3,Yes,2.5,No,"social media (twitter, instagram)",others (combination of the the four and visit ...,no,made fun of,Made friends who are like-minded,75,Working part-time,Yes,Nothing as such,18,Other European countries,Student,Female,2,bts
3,bts,Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,5,Yes,3.5,No,youtube,others (combination of the the four and visit ...,no,made fun of,Reduced stress/anxiety/depression,75,Working part-time,Yes,Nothing as such,24,USA,Student,Female,1,bts
4,bts,Yes,Its different from the usual music. I like som...,3.5,Yes,5,Yes,2.5,Not sure,youtube,learn the culture,yes,made fun of,Reduced stress/anxiety/depression,150,Parents,"I want to, but the K-Pop groups don't come to ...",Nothing as such,18,USA,Student,Female,1,bts
5,bts,Its gaining popularity,Its different from the usual music. I like som...,4.0,Yes,3,Yes,1.0,No,"others (reddit, tumbler, or none)",learn the culture,no,made fun of,Made friends who are like-minded,50,Working part-time,"I want to, but the K-Pop groups don't come to ...",Constantly avoid going out with friends/family...,18,USA,Student,Female,5,bts
6,bts,I'm not sure,The idols connect with their fans in a way tha...,3.5,I listen to both,5,Yes,1.0,Yes,youtube,learn korean fashion/makeup,yes,reduced amount of sleep,Made friends who are like-minded,50,Parents,"I want to, but the K-Pop groups don't come to ...",Nothing as such,12,Canada,Student,Female,1,bts
7,bts,Yes,Its different from the usual music. I like som...,4.0,I listen to both,3,Yes,1.0,Yes,youtube,others (combination of the the four and visit ...,no,none,Reduced stress/anxiety/depression,50,Parents,Yes,Nothing as such,18,USA,Student,Female,1,bts
8,other(s),Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,5,Yes,2.5,Not sure,youtube,others (combination of the the four and visit ...,yes,reduced amount of sleep,Made friends who are like-minded,50,Parents,"I want to, but the K-Pop groups don't come to ...",Listened to K-Pop without studying or working ...,12,A country in the UK,Student,Female,2,other(s)
9,bts,Yes,Its different from the usual music. I like som...,4.0,I listen to both,10,Yes,1.0,No,youtube,others (combination of the the four and visit ...,no,none,Reduced stress/anxiety/depression,0,Working part-time,"I want to, but the K-Pop groups don't come to ...",Nothing as such,18,USA,Student,Female,1,bts


Clean "pos_eff"

In [53]:
df['pos_eff'].value_counts()

Reduced stress/anxiety/depression                                                                                                                                                                                                                                                                                     131
Made friends who are like-minded                                                                                                                                                                                                                                                                                       77
Both                                                                                                                                                                                                                                                                                                                    7
All of the above                                          

In [54]:
# re-label positiv effects
# Reduced stress/anxiety/depression
# Made friends who are like-minded
# All of the above
# others (don't know, no effect, learning new things)

pos_ef = df['pos_eff']
pos_ef_test = pos_ef

pos_ef_test = pos_ef_test.apply(lambda x: x.lower())

for row in range(0, 240):
  if "reduced stress" in pos_ef_test.iloc[row]:
    pos_ef_test.iloc[row] = "reduced stress/anxiety/depression"
  elif "like-minded" in pos_ef_test.iloc[row]:
    pos_ef_test.iloc[row] = "made friends"
  elif "both" in pos_ef_test.iloc[row]:
    pos_ef_test.iloc[row] = "both reduced stress and made friends"
  else:
    pos_ef_test.iloc[row] = "others (don't know, no effect, learning new things)"

pos_ef_test.value_counts()

reduced stress/anxiety/depression                      132
made friends                                            77
others (don't know, no effect, learning new things)     21
both reduced stress and made friends                    10
Name: pos_eff, dtype: int64

In [0]:
df['pos_eff'] = pos_ef_test

Clean "money_src"

In [56]:
df.money_src.value_counts()

Parents                                                                                                                                              99
Working part-time                                                                                                                                    76
I have a full - time job                                                                                                                             22
Borrow from friends/partner                                                                                                                           3
No one                                                                                                                                                2
I donâ€™t                                                                                                                                             2
Savings                                                                                 

In [57]:
# re-label money source
# parents
# part-time
# full-time
# borrowed
# others (gift, scholarship)

money = df['money_src']
money_test = money

money_test = money_test.apply(lambda x: x.lower())

for row in range(0, 240):
  if "parents" in money_test.iloc[row]:
    money_test.iloc[row] = "from parents"
  elif "part-time" in money_test.iloc[row]:
    money_test.iloc[row] = "part-time job"
  elif "a full" in money_test.iloc[row]:
    money_test.iloc[row] = "full-time job"
  elif "borrow" in money_test.iloc[row]:
    money_test.iloc[row] = "borrowed"
  else:
    money_test.iloc[row] = "others (gift, scholarship, etc)"

money_test.value_counts()

from parents                       104
part-time job                       76
others (gift, scholarship, etc)     35
full-time job                       22
borrowed                             3
Name: money_src, dtype: int64

In [0]:
df['money_src'] = money_test

Clean "crazy_ev"

In [59]:
df.crazy_ev.value_counts()

Nothing as such                                                                                                                                                                                   133
Listened to K-Pop without studying or working despite having a deadline                                                                                                                            46
Missed days of work and school to attend concerts                                                                                                                                                  21
Listened to K-Pop without studying or working despite having a deadline;Missed days of work and school to attend concerts                                                                           8
Constantly avoid going out with friends/family to watch/listen K-Pop                                                                                                                                5
Listened t

In [60]:
# re-label crazy events
# nothing
# didn't study or work
# missed school or work
# others (combination of not studying/working and missing school/work, etc)

crazy = df['crazy_ev']
crazy_test = crazy

crazy_test = crazy_test.apply(lambda x: x.lower())

for row in range(0, 240):
  if "nothing" in crazy_test.iloc[row] and ";" not in crazy_test.iloc[row]:
    crazy_test.iloc[row] = "nothing"
  elif "without studying or working" in crazy_test.iloc[row] and ";" not in crazy_test.iloc[row]:
    crazy_test.iloc[row] = "didn't study or work"
  elif "Missed days" in crazy_test.iloc[row]:
    crazy_test.iloc[row] = "missed school or work"
  else:
    crazy_test.iloc[row] = "others (combination of not studying/working and missing school/work, etc)"

crazy_test.value_counts()

nothing                                                                      133
others (combination of not studying/working and missing school/work, etc)     61
didn't study or work                                                          46
Name: crazy_ev, dtype: int64

In [0]:
df['crazy_ev'] = crazy_test

Clean "country"

In [62]:
df.country.value_counts()

USA                             140
A country in the UK              26
Other European countries         18
Canada                           15
Latin America                     8
Australia                         7
Germany                           6
France                            5
Other Asian country               5
South Africa                      2
New Zealand                       2
A country from the Caribbean      1
Turkey                            1
canada                            1
Sweden                            1
CANNNAADAAAA BOIIIIIIS            1
Finland                           1
Name: country, dtype: int64

In [63]:
# relabel
# usa, uk, other european, canada (can), latin amrica, australia, germany, other asian, france, others
con = df['country']
con_test = con

con_test = con_test.apply(lambda x: x.lower())

for row in range(0, 240):
  if "usa" in con_test.iloc[row]:
    con_test.iloc[row] = "usa"
  elif "uk" in con_test.iloc[row]:
    con_test.iloc[row] = "uk"
  elif "european" in con_test.iloc[row]:
    con_test.iloc[row] = "other european countries"
  elif "can" in con_test.iloc[row]:
    con_test.iloc[row] = "canada"
  elif "latin" in con_test.iloc[row]:
    con_test.iloc[row] = "latin america"
  elif "australia" in con_test.iloc[row]:
    con_test.iloc[row] = "australia"
  elif "germany" in con_test.iloc[row]:
    con_test.iloc[row] = "germany"
  elif "asian" in con_test.iloc[row]:
    con_test.iloc[row] = "other asian countries"
  elif "france" in con_test.iloc[row]:
    con_test.iloc[row] = "france"
  else:
    con_test.iloc[row] = "others (south africa, new zealand, sweden, finland, turkey, caribbean)"

con_test.value_counts()

usa                                                                       140
uk                                                                         26
other european countries                                                   18
canada                                                                     17
latin america                                                               8
others (south africa, new zealand, sweden, finland, turkey, caribbean)      8
australia                                                                   7
germany                                                                     6
france                                                                      5
other asian countries                                                       5
Name: country, dtype: int64

In [64]:
df['country'] = con_test
df.head()

Unnamed: 0,fav_grp,popl_by_co_yn,reason,yr_listened,gender_pref,daily_music_hr,watch_MV_yn,daily_MV_hr,obsessed_yn,news_medium,pursuit,time_cons_yn,life_chg,pos_eff,yr_merch_spent,money_src,concert_yn,crazy_ev,age,country,job,gender,num_gr_like,bts_vs_others
0,bts,Its gaining popularity,Its different from the usual music. I like som...,1.5,"Its about the music, not the gender",5,Yes,2.5,Not sure,youtube,others (combination of the the four and visit ...,no,made fun of,reduced stress/anxiety/depression,75,full-time job,Yes,others (combination of not studying/working an...,18,uk,Sales,Female,1,bts
1,bts,Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,3,Yes,2.5,Yes,youtube,others (combination of the the four and visit ...,yes,made fun of,made friends,50,part-time job,"I want to, but the K-Pop groups don't come to ...",others (combination of not studying/working an...,18,uk,Student,Female,1,bts
2,bts,Its gaining popularity,Its different from the usual music. I like som...,4.0,"Its about the music, not the gender",3,Yes,2.5,No,"social media (twitter, instagram)",others (combination of the the four and visit ...,no,made fun of,made friends,75,part-time job,Yes,nothing,18,other european countries,Student,Female,2,bts
3,bts,Its gaining popularity,Its different from the usual music. I like som...,3.5,I listen to both,5,Yes,3.5,No,youtube,others (combination of the the four and visit ...,no,made fun of,reduced stress/anxiety/depression,75,part-time job,Yes,nothing,24,usa,Student,Female,1,bts
4,bts,Yes,Its different from the usual music. I like som...,3.5,Yes,5,Yes,2.5,Not sure,youtube,learn the culture,yes,made fun of,reduced stress/anxiety/depression,150,from parents,"I want to, but the K-Pop groups don't come to ...",nothing,18,usa,Student,Female,1,bts


In [65]:
# change the entire dataframe to lower case
df = df.apply(lambda x: x.astype(str).str.lower())

df.head()

Unnamed: 0,fav_grp,popl_by_co_yn,reason,yr_listened,gender_pref,daily_music_hr,watch_MV_yn,daily_MV_hr,obsessed_yn,news_medium,pursuit,time_cons_yn,life_chg,pos_eff,yr_merch_spent,money_src,concert_yn,crazy_ev,age,country,job,gender,num_gr_like,bts_vs_others
0,bts,its gaining popularity,its different from the usual music. i like som...,1.5,"its about the music, not the gender",5,yes,2.5,not sure,youtube,others (combination of the the four and visit ...,no,made fun of,reduced stress/anxiety/depression,75,full-time job,yes,others (combination of not studying/working an...,18,uk,sales,female,1,bts
1,bts,its gaining popularity,its different from the usual music. i like som...,3.5,i listen to both,3,yes,2.5,yes,youtube,others (combination of the the four and visit ...,yes,made fun of,made friends,50,part-time job,"i want to, but the k-pop groups don't come to ...",others (combination of not studying/working an...,18,uk,student,female,1,bts
2,bts,its gaining popularity,its different from the usual music. i like som...,4.0,"its about the music, not the gender",3,yes,2.5,no,"social media (twitter, instagram)",others (combination of the the four and visit ...,no,made fun of,made friends,75,part-time job,yes,nothing,18,other european countries,student,female,2,bts
3,bts,its gaining popularity,its different from the usual music. i like som...,3.5,i listen to both,5,yes,3.5,no,youtube,others (combination of the the four and visit ...,no,made fun of,reduced stress/anxiety/depression,75,part-time job,yes,nothing,24,usa,student,female,1,bts
4,bts,yes,its different from the usual music. i like som...,3.5,yes,5,yes,2.5,not sure,youtube,learn the culture,yes,made fun of,reduced stress/anxiety/depression,150,from parents,"i want to, but the k-pop groups don't come to ...",nothing,18,usa,student,female,1,bts


In [0]:
# save the cleaned datafram to csv
df.to_csv("cleaned kpop data.csv", index = False)