**TOPIC: Study the relationship between social media platforms and music tastes**<br>
**Questions**<br>
**1. What data do you have?**<br>
We have data of the age, gender, daily music consumption, preferred social media platform, music genre, occasion, and preferred music platform of our participants.<br><br>
**2. What would you like to know?**<br>
Is there a correlation between certain social media platforms and music genres? Among those who do a certain chore, what are their favorite music genres and music platform? Can social media platforms be used to predict daily music consumption?<br><br>
**3. Explore the data.**<br>
Generate statistics and perform visualizations. Explain what you are computing (mean, SD, ...), and then
compute using Python. Create some visualizations (at least 8, and at least 6
must be of different types); use Python.<br><br>
**4. Can you state any hypotheses or make predictions? Which tests can you apply to
verify your hypothesis? State clearly each of your hypotheses (at least 3).**<br>
answer<br><br>
**5. Test your hypotheses.**<br>
Test your hypotheses and predictions (use at least 2 different
tests). For each: i) describe the test you are using; ii) perform it; iii) analyze
the results and draw the conclusion. You must perform correlation analysis and chi-squared test.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import waterfall_chart as wfc
from scipy.stats import chi2_contingency
from scipy.stats import chi2

df = pd.read_csv("CS105 W23 Survey Responses - Form Responses 1.csv")
# df.head()

#data cleansing
#columns numbers to keep: 
#1. How old are you?
#68. Which social media platforms do you use most often?
#69. What is your favorite genre of music?
#71. Which music streaming service do you use the most?
#62. On average, how many hours do you listen to music per day?
#70. In which of the following situations do you listen to music?

# #create empty dataframe
# df2 = pd.DataFrame()
# #append columns
# col1 = df["1. How old are you?"]
# col68 = df["68. Which social media platforms do you use most often?"] 
# col69 = df["69. What is your favorite genre of music?"]
# col71 = df["71. Which music streaming service do you use the most?"]
# col62 = df["62. On average, how many hours do you listen to music per day?"]
# col70 = df["70. In which of the following situations do you listen to music?"]

# df2.insert(0, "1. How old are you?", col1)
# df2 = df2.join(col68)
# df2 = df2.join(col69)
# df2 = df2.join(col71)
# df2 = df2.join(col62)
# df2 = df2.join(col70)

# df2

#remove outliers, fill/remove corrupt/empty values

In [2]:
#create new table
df2 = df[['1. How old are you?',
          '2. To which gender identity do you most identify?',
          '62. On average, how many hours do you listen to music per day?', 
          '68. Which social media platforms do you use most often?', 
          '69. What is your favorite genre of music?', 
          '70. In which of the following situations do you listen to music?', 
          '71. Which music streaming service do you use the most?']]

#rename columns
df2 = df2.rename(columns={'1. How old are you?': 'Age', 
                          '2. To which gender identity do you most identify?': 'Gender', 
                          '62. On average, how many hours do you listen to music per day?': 'Daily Music Consumption', 
                          '68. Which social media platforms do you use most often?': 'Preferred Social Media Platform', 
                          '69. What is your favorite genre of music?': 'Music Genre', 
                          '70. In which of the following situations do you listen to music?': 'Occasion', 
                          '71. Which music streaming service do you use the most?': 'Preferred Music Platform'})

df2

Unnamed: 0,Age,Gender,Daily Music Consumption,Preferred Social Media Platform,Music Genre,Occasion,Preferred Music Platform
0,21,Man,4.0,"Reddit, Twitter, Youtube",Hip Hop / Rap,"Doing chores, Driving, Eating, Exercising, Rel...",Spotify
1,21,Woman,4.0,"Instagram, Youtube",Rock,"Relaxing, Studying / Doing homework",Spotify
2,22,Man,2.0,"Reddit, TikTok, Youtube",Anime / Jpop,"Driving, Exercising, Studying / Doing homework",Spotify
3,20,Woman,1.0,"Instagram, Reddit, TikTok",Pop,"Doing chores, Driving",Spotify
4,21,Man,2.0,"Reddit, Youtube",Pop,"Doing chores, Exercising, Studying / Doing hom...",Spotify
...,...,...,...,...,...,...,...
219,21,Woman,4.0,"Instagram, TikTok, Twitter",Indie / Alternative,"Doing chores, Driving, Studying / Doing homework",Spotify
220,20,Man,2.0,,Salsa music,Relaxing,Spotify
221,20,Man,20.0,"Instagram, Snapchat, TikTok",Hip Hop / Rap,"Driving, Eating, Exercising, Relaxing, Studyin...",Spotify
222,20,Woman,4.0,"Instagram, TikTok, Twitter",Indie / Alternative,"Doing chores, Driving, Eating, Exercising, Rel...",Spotify


In [3]:
#preliminary data
# display(df2.groupby("Preferred Social Media Platform")["Age"].count())


In [4]:
display(df2.groupby("Music Genre")["Age"].count())


Music Genre
Anime / Jpop                                                        19
Bladee                                                               1
Breakcore                                                            1
Classical                                                            6
Electronic                                                          18
Funk                                                                 1
Hip Hop / Rap                                                       49
House                                                                1
I have no favorite                                                   1
Indie / Alternative                                                 32
Kpop                                                                10
Lofi                                                                 1
Metal                                                                4
Most genres like anime, jpop, edm, hip hop, kpop, pop, r&b, lofi 

In [5]:
# display(df2.groupby("Occasion")["Age"].count())


In [6]:
# display(df2.groupby("Preferred Music Platform")["Age"].count())


In [7]:
display(df2.groupby("Gender")["Age"].count())


Gender
Man                      169
Non-binary                 2
Prefer not to respond      2
Woman                     49
Name: Age, dtype: int64

In [8]:
# display(df2.groupby("Daily Music Consumption")["Age"].count())

In [9]:
#clean values, arr[0] = "X"
df2.loc[df2["Music Genre"] == "Breakcore", "Music Genre"] = "Electronic"
df2.loc[df2["Music Genre"] == "Bladee", "Music Genre"] = "Hip Hop / Rap"
df2.loc[df2["Music Genre"] == "Funk", "Music Genre"] = "Pop"
df2.loc[df2["Music Genre"] == "House", "Music Genre"] = "Electronic"
df2.loc[df2["Music Genre"] == "I have no favorite", "Music Genre"] = "No preference"
df2.loc[df2["Music Genre"] == "Lofi", "Music Genre"] = "Other"
df2.loc[df2["Music Genre"] == "Most genres like anime, jpop, edm, hip hop, kpop, pop, r&b, lofi", "Music Genre"] = "Anime / Jpop"
df2.loc[df2["Music Genre"] == "Musical Theater", "Music Genre"] = "Other"
df2.loc[df2["Music Genre"] == "Phonk", "Music Genre"] = "Electronic"
df2.loc[df2["Music Genre"] == "Salsa music", "Music Genre"] = "Other"
df2.loc[df2["Music Genre"] == "Soundtrack ", "Music Genre"] = "Other"
df2.loc[df2["Music Genre"] == "Video Game", "Music Genre"] = "Other"
df2.loc[df2["Music Genre"] == "any, whatever sounds good to my ear", "Music Genre"] = "No preference"
df2.loc[df2["Music Genre"] == "electronic pop ", "Music Genre"] = "Electronic"
df2.loc[df2["Music Genre"] == "idk", "Music Genre"] = "No preference"
df2.loc[df2["Music Genre"] == "indie pop", "Music Genre"] = "Indie / Alternative"
df2.loc[df2["Music Genre"] == "multi", "Music Genre"] = "Other"

display(df2.groupby("Music Genre")["Age"].count())


Music Genre
Anime / Jpop           20
Classical               6
Electronic             22
Hip Hop / Rap          50
Indie / Alternative    33
Kpop                   10
Metal                   4
No preference           4
Other                   6
Pop                    32
R&B / Soul / Blues     18
Rock                   11
Name: Age, dtype: int64

In [10]:
df2.loc[~df2['Gender'].isin(['Man', 'Woman']), 'Gender'] = 'Other'

display(df2.groupby("Gender")["Age"].count())

Gender
Man      169
Other      6
Woman     49
Name: Age, dtype: int64

In [11]:
display(df2.groupby("Daily Music Consumption")["Age"].count())
# print(df2.dtypes)

Daily Music Consumption
0.0     12
1.0     40
2.0     35
3.0     22
4.0     34
5.0     14
6.0     14
7.0      5
8.0     13
9.0      1
10.0     6
12.0     7
13.0     1
14.0     1
15.0     1
18.0     1
20.0     3
24.0     3
25.0     1
30.0     1
40.0     1
50.0     2
Name: Age, dtype: int64

In [12]:
#fill nan values
avgHours = df.loc[:, '62. On average, how many hours do you listen to music per day?'].mean()
# display(int(avgHours))

df2["Daily Music Consumption"] = df2["Daily Music Consumption"].fillna(int(avgHours))
# print(df2.dtypes)

df2.loc[df2['Daily Music Consumption'] > 9.0, 'Daily Music Consumption'] = 10

df2["Daily Music Consumption"] = df2["Daily Music Consumption"].astype(str)
df2.loc[df2['Daily Music Consumption'] == "0.0", 'Daily Music Consumption'] = '0-1'
df2.loc[df2['Daily Music Consumption'] == "1.0", 'Daily Music Consumption'] = '0-1'
df2.loc[df2['Daily Music Consumption'] == "2.0", 'Daily Music Consumption'] = '2-3'
df2.loc[df2['Daily Music Consumption'] == "3.0", 'Daily Music Consumption'] = '2-3'
df2.loc[df2['Daily Music Consumption'] == "4.0", 'Daily Music Consumption'] = '4-5'
df2.loc[df2['Daily Music Consumption'] == "5.0", 'Daily Music Consumption'] = '4-5'
df2.loc[df2['Daily Music Consumption'] == "6.0", 'Daily Music Consumption'] = '6-7'
df2.loc[df2['Daily Music Consumption'] == "7.0", 'Daily Music Consumption'] = '6-7'
df2.loc[df2['Daily Music Consumption'] == "8.0", 'Daily Music Consumption'] = '8-9'
df2.loc[df2['Daily Music Consumption'] == "9.0", 'Daily Music Consumption'] = '8-9'
df2.loc[df2['Daily Music Consumption'] == "10.0", 'Daily Music Consumption'] = '10+'


display(df2.groupby("Daily Music Consumption")["Age"].count())

Daily Music Consumption
0-1    52
10+    28
2-3    57
4-5    54
6-7    19
8-9    14
Name: Age, dtype: int64

In [13]:
# df = df[df['Daily Music Consumption'] <= 24]
# def categorize_hours(hours):
#     if hours <= 2:
#         return '0-2'
#     elif hours <= 5:
#         return '3-5'
#     elif hours <= 8:
#         return '6-8'
#     else:
#         return '8+'

# df['Hours Cohort'] = df['Daily Music Consumption'].apply(categorize_hours)
# df['Hours Cohort']

In [14]:
# df2.loc[df2['Age'] < 19, 'Age'] = 18
df2.loc[df2['Age'] > 24, 'Age'] = 25

df2["Age"] = df2["Age"].astype(str)
df2.loc[df2['Age'] == "17", 'Age'] = '17-18'
df2.loc[df2['Age'] == "18", 'Age'] = '17-18'
df2.loc[df2['Age'] == "19", 'Age'] = '19-20'
df2.loc[df2['Age'] == "20", 'Age'] = '19-20'
df2.loc[df2['Age'] == "21", 'Age'] = '21-22'
df2.loc[df2['Age'] == "22", 'Age'] = '21-22'
df2.loc[df2['Age'] == "23", 'Age'] = '23-24'
df2.loc[df2['Age'] == "24", 'Age'] = '23-24'
df2.loc[df2['Age'] == "25", 'Age'] = '25+'



display(df2.groupby("Age")["Gender"].count())

Age
17-18      2
19-20    108
21-22     71
23-24     21
25+       22
Name: Gender, dtype: int64

In [15]:
display(df2.groupby("Preferred Social Media Platform")["Age"].count())


Preferred Social Media Platform
BiliBili                         1
Facebook, Instagram              1
Facebook, Instagram, Youtube     1
Facebook, Reddit, TikTok         1
Facebook, Youtube                1
Instagram                       10
Instagram, Reddit                2
Instagram, Reddit, Snapchat      4
Instagram, Reddit, TikTok        4
Instagram, Reddit, Twitter       4
Instagram, Reddit, Youtube      16
Instagram, Snapchat              2
Instagram, Snapchat, TikTok     11
Instagram, Snapchat, Twitter     4
Instagram, Snapchat, Youtube     7
Instagram, TikTok                6
Instagram, TikTok, Safari        1
Instagram, TikTok, Twitter      16
Instagram, TikTok, Youtube      21
Instagram, Twitch                1
Instagram, Twitter               1
Instagram, Twitter, Youtube      9
Instagram, Wechat, BiliBili      1
Instagram, Youtube              20
Reddit                           3
Reddit, Snapchat, TikTok         1
Reddit, TikTok                   1
Reddit, TikTok, Twitter

In [17]:
#count preferred social media platform numbers
facebook = 0
instagram = 0
reddit = 0
snapchat = 0
tiktok = 0
twitter = 0
youtube = 0
other = 0
