## Exploratory Analysis of Skincare Subreddits

For this exploratory analysis, we are looking at skincare trends through reddit. Reddit has various skincare platforms, when searching "Skincare on Reddit" these are the top 3 subreddits: 

1. r/SkincareAddcition with 4.3m members
2. r/AsianBeauty with 2.9m members
3. r/30PlusSkinCare with 2.1m members

Additional, country specific subreddits: 
1. r/SkincareAddictionUK with 484k members
2. r/IndianSkincareAddicts with 242k members
3. r/AusSkincare with 177k members

There are other large subreddits such as:
1. r/Skincare_Addiction with 1.8 members
2. r/SkincareAddicts with 1m members
However, these two subreddits are likely spin-offs of r/SkincareAddition. While the rest of the subreddits target a nicher group (Asian brands, >30, UK, India, Aus), the demographic of these two would be similar to r/SkincareAddiction and thus will not be used. 

Exploratory data analysis will be on these three subreddits. We are going to first explore the top posts of each subreddits. 

1st = just a table to see the top 100 posts 
2nd = table to see how often certain ingredients have been referenced over the three subs
3rd = table to see top posts which have referenced these ingredients in title 

### Step 1: Crawling a real world data set 
#### Step 1.1 Importing necessary functions and setting up PRAW

In [4]:
#import necessary functions (as needed so far)
import datetime 
import praw #reddit crawler
import pandas as pd

#things needed to use PRAW
client_id = '5KkxQHtUgHzvz6pPMTbvSw'
client_secret = 'mWXvZcxvcpyEheEt_gM_3ODTvOBw7g'
user_agent = 'cryinginpython98'

reddit = praw.Reddit(client_id=client_id,client_secret=client_secret,user_agent=user_agent)

#create a list of subreddits 
#create empty list for the posts 
#loop through - take title, body, upvotes, comment, created 

#### Step 1.2 Checking if reddit API Key is working
Output = true if working

In [6]:
print(reddit.read_only) #check if it is working, needs to output == True

True


#### Step 1.3: Selecting Subreddits 
This exploratory analysis will be looking at 6 different skincare subreddits. As mentioned above, a general skincare subreddit (which is also the most popular), as well as more niche subreddits that are targetted at people who like specific brands (Asian Beauty), people from different countries (UK, Aus/NZ, India), or a different age group (>30).

In [8]:
#input all the subreddits we are looking at
subreddit1 = reddit.subreddit('SkincareAddiction')
subreddit2 = reddit.subreddit('AsianBeauty')
subreddit3 = reddit.subreddit('30PlusSkinCare')
subreddit4 = reddit.subreddit('SkincareAddictionUK')
subreddit5 = reddit.subreddit('IndianSkincareAddicts')
subreddit6 = reddit.subreddit('AusSkincare')
#create a list to loop through
subreddits = [subreddit1,subreddit2,subreddit3,subreddit4,subreddit5,subreddit6]

#Displaying the subreddits, also as a check to see if the subreddits are being called correctly
for subreddit in subreddits:
    # Display the name of the Subreddit
    print("Display Name:", subreddit.display_name)
    # Display the title of the Subreddit
    print("Title:", subreddit.title)

Display Name: SkincareAddiction
Title: For anything and everything having to do with skincare!
Display Name: AsianBeauty
Title: AsianBeauty
Display Name: 30PlusSkinCare
Title: Skin care for people over 30
Display Name: SkincareAddictionUK
Title: A UK-centric skincare subreddit.
Display Name: IndianSkincareAddicts
Title: IndianSkincareAddicts
Display Name: AusSkincare
Title: Australian & New Zealand Skincare


#### Step 1.4 Loading of Subreddit Data into DataFrame
Inital runs of this exploratory analysis only looked into the top posts, however, due to the casual nature of these forums, there are many joke (meme) posts. This was not as condusive to looking into the skincare side of the skincare subreddit. Hence, we are going to look at specific ingridents and skin concerns. Popular or trending skincare ingridents are identified with the a google, and a couple ingridents manually added based on own knowledge. Similarly, common skin concerns are identified with google.

#### Step 1.5 Defining Skin Concerns and Ingredients

In [20]:
#Ingredients that will be searched for
ing = [
    'retinol', 'vitamin c', 'hyaluronic', 'niacinamide', 'salicylic',
    'benzoyl peroxide', 'glycerin', 'peptide', 'ceramide',
    'bakuchiol', 'vitamin e', 'glycolic', 'AHA', 'BHA', 'PHA', 
    'squalene', 'jojoba', 'azelaic', 'hydroquinone', 'lactic','SPF'
]
#concerns to be looked at
concerns = [
    'acne', 'dry', 'dull', 'redness', 'dark circles', 'eye bags', 
    'wrinkle', 'aging', 'uneven', 'rough', 'hyperpigmentation', 'sunscreen'
]

In [16]:
def get_date(submission):
    '''Function to convert the timestamp from html to useable time'''
    time = submission.created
    return datetime.datetime.fromtimestamp(time).strftime('%Y-%m-%d %H:%M:%S')

In [18]:
#initialize list that will hold all the data scrapped
info = []

In [24]:
#loop through each of the subreddits
for subreddit in subreddits:
    #loop through the top 1000 posts
    for sub in subreddit.top(limit=1000):
        #lowercase for search
        title = sub.title.lower()  
        body = sub.selftext.lower()

        #default 
        ing_pres = None
        concern_pres = None 

        #loop for each ingredient in the ing list
        for ingredient in ing:
            #if exists in the posts title or body
            if ingredient in title or ingredient in body:
                #ingredient present, assign value
                ing_pres = ingredient
    
        #loop for each concern in concern list
        for concern in concerns:
            #if exists in posts title or body
            if concern in title or concern in body:
                concern_pres = concern

        #call function to change the date to something useable
        sub_date = get_date(sub)

        #now the main thing,
        #if the post contains a mention of what we want (ingredient or concern)
        #add it to the dictionary
        if ing_pres or concern_pres:
                sub_data = {
                    'subreddit': subreddit.display_name,
                    'title': sub.title,
                    'body': sub.selftext,
                    'upvotes': sub.score,
                    'num_comments': sub.num_comments,
                    'url': sub.url,
                    'ingredient' : ing_pres,
                    'concern': concern_pres,
                    'date': sub_date
                }
                #append to list
                info.append(sub_data)

print(len(info)) #see what data collected 

2105


#### Step 1.6 Looping through the subreddits to extract the relevant data

In [26]:
#create pandas dataframe
df=pd.DataFrame(info)
#see whats inside
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2105 entries, 0 to 2104
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subreddit     2105 non-null   object
 1   title         2105 non-null   object
 2   body          2105 non-null   object
 3   upvotes       2105 non-null   int64 
 4   num_comments  2105 non-null   int64 
 5   url           2105 non-null   object
 6   ingredient    716 non-null    object
 7   concern       1849 non-null   object
 8   date          2105 non-null   object
dtypes: int64(2), object(7)
memory usage: 148.1+ KB


In [28]:
df.head(10) #see whats inside

Unnamed: 0,subreddit,title,body,upvotes,num_comments,url,ingredient,concern,date
0,SkincareAddiction,Posted here over a month ago showing how [acne...,,17351,257,https://i.redd.it/dtm3c3p277z41.jpg,,acne,2020-05-16 22:50:03
1,SkincareAddiction,[Anti-Aging] I may have used too much retinol ...,,15803,115,https://i.redd.it/r8g7c71mti3a1.jpg,retinol,sunscreen,2022-12-02 17:36:48
2,SkincareAddiction,[Selfie] 2 year transformation and glow up. Cy...,,11643,287,https://i.redd.it/8l4z6jzyogb51.jpg,,acne,2020-07-17 19:36:02
3,SkincareAddiction,[Personal] My Mother at 53 years old. She's th...,,11231,272,http://i.imgur.com/Ph4JiDD.jpg,,sunscreen,2016-10-16 22:19:03
4,SkincareAddiction,"[B&A] [Selfie] 3 microneedling sessions, 1 las...",,11150,340,https://i.redd.it/ruermpk6cwg31.jpg,hyaluronic,,2019-08-17 00:35:56
5,SkincareAddiction,[Before&After] Finding the right dermatologist...,,11140,484,https://www.reddit.com/gallery/nn7th5,,sunscreen,2021-05-28 22:13:12
6,SkincareAddiction,[PSA] SKIN CARE FOR PROTESTERS,\nFOR PEPPER SPRAY: \n\n-Don’t touch the expos...,10865,333,https://www.reddit.com/r/SkincareAddiction/com...,,sunscreen,2020-06-02 18:15:31
7,SkincareAddiction,Puberty is making [Acne] hit hard but we’re tr...,,10508,435,https://i.redd.it/o74bastn0iq41.jpg,,rough,2020-04-03 01:36:29
8,SkincareAddiction,[B&A] I posted my acne scar treatment progress...,,10310,350,https://i.redd.it/g7dlglcqhn011.jpg,,acne,2018-05-28 20:47:06
9,SkincareAddiction,[Acne] One year apart ✨,,9529,259,https://www.reddit.com/gallery/ltlqvl,,acne,2021-02-27 10:56:48


In [30]:
df.tail(10) #see whats inside from behind

Unnamed: 0,subreddit,title,body,upvotes,num_comments,url,ingredient,concern,date
2095,AusSkincare,Thoughts on reuseable silicone eye gels?,What the title says - I’ve been looking into g...,42,19,https://i.redd.it/oc22e1t7kjya1.jpg,,redness,2023-05-08 01:13:57
2096,AusSkincare,I need help with blackheads,I'm currently using the hydro boost as my clea...,41,44,https://i.redd.it/0z85d7qo9jv91.jpg,,rough,2022-10-23 06:33:45
2097,AusSkincare,Meccas Sunscreen serum ☀️,Has anyone tried this yet? It looks so interes...,42,9,https://i.redd.it/yp6u27kfaqj91.jpg,,sunscreen,2022-08-24 22:20:46
2098,AusSkincare,We can now drop off all brands of empty beauty...,,40,2,https://i.redd.it/wfnxr6c4tdg91.png,,aging,2022-08-08 01:11:11
2099,AusSkincare,Your favourite SPF50+ long lasting sunscreen,I’ve just got a job as a Traffic Controller wh...,42,21,https://www.reddit.com/r/AusSkincare/comments/...,,sunscreen,2021-11-30 09:30:33
2100,AusSkincare,Whats the point of having actives in cleansers...,Unpopular opinion i know. I'm not new to skinc...,42,15,https://www.reddit.com/r/AusSkincare/comments/...,vitamin c,rough,2021-09-08 10:29:31
2101,AusSkincare,Hey what’s everyone’s experience with these tw...,,42,72,https://www.reddit.com/gallery/jsqrds,,sunscreen,2020-11-12 07:54:16
2102,AusSkincare,"Priceline 3 Day Sale (40% off Skincare, Suncar...",Priceline is doing another big 3 day sale! \n\...,41,57,https://www.reddit.com/r/AusSkincare/comments/...,jojoba,,2020-08-11 10:48:35
2103,AusSkincare,PSA: Moo Goo has launched a 1% bakuchiol serum!,"So I was on Moo Goo's website, clicking around...",41,21,https://www.reddit.com/r/AusSkincare/comments/...,bakuchiol,,2020-06-21 04:42:50
2104,AusSkincare,Best Of/ Holy Grail Products: EXFOLIANTS,Hi there and welcome to the Best Of/ Holy Grai...,44,19,https://www.reddit.com/r/AusSkincare/comments/...,,rough,2019-10-15 01:02:53


#### Step 1.7 Export into CSV

In [34]:
df.to_csv('OX24006_SDPA_data.csv') #export to csv

#### 1.8 Describing this data set: 

This data set comes from 6 different skincare subreddits. The most popular skincare subreddit, 2 of the next most popular skincare subreddits that are targetted at a more niche audience (Asian Beauty and Skincare over 30), as well as 3 more niche subreddits which are still in the top few but relatively not as popular. 

Data was scrapped using PRAW, Python Reddit API Wrapper. This package allowed  BLA BLA BLA. The initial list of variables for consideration were based off PRAW guides, citing Title, Body (Self-text), URL, upvotes and comments. 

Hence, the variables of interest in this analysis are: 

### Step 2: Data Preparation and Cleaning 
The next step in this analysis would be to clean and prepare the data.

First remove duplicates
Handling missing data: Observe where it is from, remove any missing data 
Remove any duplicates
Handle any outliers or inconsistencies in the data, if any.
Perform any additional steps to enrich your data (parsing dates, creating additional
columns/features, etc.)

In [40]:
#So a proper copy is needed, if not any changes made to a "copy" will mess up the analysis 
wdf = df.copy() #working df

#### Step 2.1 Handling duplicates

Using drop duplicates to remove any duplicate values.

In [43]:
wdf.drop_duplicates()

Unnamed: 0,subreddit,title,body,upvotes,num_comments,url,ingredient,concern,date
0,SkincareAddiction,Posted here over a month ago showing how [acne...,,17351,257,https://i.redd.it/dtm3c3p277z41.jpg,,acne,2020-05-16 22:50:03
1,SkincareAddiction,[Anti-Aging] I may have used too much retinol ...,,15803,115,https://i.redd.it/r8g7c71mti3a1.jpg,retinol,sunscreen,2022-12-02 17:36:48
2,SkincareAddiction,[Selfie] 2 year transformation and glow up. Cy...,,11643,287,https://i.redd.it/8l4z6jzyogb51.jpg,,acne,2020-07-17 19:36:02
3,SkincareAddiction,[Personal] My Mother at 53 years old. She's th...,,11231,272,http://i.imgur.com/Ph4JiDD.jpg,,sunscreen,2016-10-16 22:19:03
4,SkincareAddiction,"[B&A] [Selfie] 3 microneedling sessions, 1 las...",,11150,340,https://i.redd.it/ruermpk6cwg31.jpg,hyaluronic,,2019-08-17 00:35:56
...,...,...,...,...,...,...,...,...,...
2100,AusSkincare,Whats the point of having actives in cleansers...,Unpopular opinion i know. I'm not new to skinc...,42,15,https://www.reddit.com/r/AusSkincare/comments/...,vitamin c,rough,2021-09-08 10:29:31
2101,AusSkincare,Hey what’s everyone’s experience with these tw...,,42,72,https://www.reddit.com/gallery/jsqrds,,sunscreen,2020-11-12 07:54:16
2102,AusSkincare,"Priceline 3 Day Sale (40% off Skincare, Suncar...",Priceline is doing another big 3 day sale! \n\...,41,57,https://www.reddit.com/r/AusSkincare/comments/...,jojoba,,2020-08-11 10:48:35
2103,AusSkincare,PSA: Moo Goo has launched a 1% bakuchiol serum!,"So I was on Moo Goo's website, clicking around...",41,21,https://www.reddit.com/r/AusSkincare/comments/...,bakuchiol,,2020-06-21 04:42:50


Return value shows no duplicates in dataset (started at 2105, still 2105)

#### Step 2.2 Handling missing data
##### Item 1: 
From glancing at the dataset, there are many instances of missing body in a reddit submission. These will need to be deleted. 

##### Item 2: 
In addition, there are two instances of values that are missing on purpose. The scraping rules set wanted to find posts related to either specific skincare ingredients or specific skin care concerns. Not all posts would have both. 

##### Item 3: 
Before removal of rows with missing values, the necessary null values will be filled. 

##### Item 4:
In addition, I want to also observe the dataset that has both ingredient and concern. Hence, another copy will be made, and all none values will be removed 

##### Item 4:

In [82]:
#Item 4
df_both = df.copy() #create copy

In [84]:
df_both.dropna() #drop rows with anything empty 

Unnamed: 0,subreddit,title,body,upvotes,num_comments,url,ingredient,concern,date
1,SkincareAddiction,[Anti-Aging] I may have used too much retinol ...,,15803,115,https://i.redd.it/r8g7c71mti3a1.jpg,retinol,sunscreen,2022-12-02 17:36:48
25,SkincareAddiction,"[Misc] Some of you need a therapist, not a der...",Some of the posts I see on here are incredibly...,7495,435,https://www.reddit.com/r/SkincareAddiction/com...,retinol,aging,2022-06-16 15:25:32
40,SkincareAddiction,[PSA] If someone is happy with their skincare ...,I think many of us have been in this situation...,6432,482,https://www.reddit.com/r/SkincareAddiction/com...,hyaluronic,sunscreen,2019-10-02 00:39:13
56,SkincareAddiction,[B&A] A lurkers 14-15 months of hard work and ...,"As an avid lurker, I have finally decided to p...",5668,241,https://www.reddit.com/gallery/z4wy65,glycolic,sunscreen,2022-11-26 03:36:58
69,SkincareAddiction,"[Humor] Me wait for the Niacinamide, Retinol a...",,5307,126,https://i.imgur.com/nRKOC3M.jpg,niacinamide,acne,2019-11-15 06:29:03
...,...,...,...,...,...,...,...,...,...
2075,AusSkincare,Accutane saved me.,"https://i.imgur.com/ulESxbZ.jpg \n\n27 now, I’...",46,31,https://www.reddit.com/r/AusSkincare/comments/...,lactic,sunscreen,2020-07-08 13:38:09
2079,AusSkincare,AIRYDAY SPF Review: Mineral Mousse & Pretty in...,Any questions? Please do ask! And share how yo...,46,37,https://www.reddit.com/gallery/13f6t5h,vitamin c,dry,2023-05-12 02:39:18
2081,AusSkincare,Reviews of popular sunscreens I’ve tried so far,"Been a bit of a sunscreen junkie lately, so th...",44,24,https://www.reddit.com/r/AusSkincare/comments/...,vitamin c,sunscreen,2021-08-23 11:01:07
2094,AusSkincare,Priceline’s free skincare gift bag promotion s...,What I purchased to get the gift bag: \n\nCer...,44,28,https://i.redd.it/9tndg6xsyu1c1.jpg,niacinamide,hyperpigmentation,2023-11-22 08:16:33


The current row count is currently 460. However the preview shows that there is still empty cells in 'body'. In addition, the 'info' portion shows no null values. This indicates that the value in body could possibly be an empty string. 

Replace empty string and remove null values again

In [88]:
#replace "_" with NA, null value
df_both.replace("", pd.NA, inplace=True)
#remove empty
df_both.dropna(inplace=True)
df_both #print to check

Unnamed: 0,subreddit,title,body,upvotes,num_comments,url,ingredient,concern,date
25,SkincareAddiction,"[Misc] Some of you need a therapist, not a der...",Some of the posts I see on here are incredibly...,7495,435,https://www.reddit.com/r/SkincareAddiction/com...,retinol,aging,2022-06-16 15:25:32
40,SkincareAddiction,[PSA] If someone is happy with their skincare ...,I think many of us have been in this situation...,6432,482,https://www.reddit.com/r/SkincareAddiction/com...,hyaluronic,sunscreen,2019-10-02 00:39:13
56,SkincareAddiction,[B&A] A lurkers 14-15 months of hard work and ...,"As an avid lurker, I have finally decided to p...",5668,241,https://www.reddit.com/gallery/z4wy65,glycolic,sunscreen,2022-11-26 03:36:58
80,SkincareAddiction,[Before&After] 6 months of retinol,Just wanted to share some progress on my skinc...,4957,242,https://i.redd.it/xrn26lu70ak91.jpg,bakuchiol,wrinkle,2022-08-27 16:39:06
83,SkincareAddiction,[Personal] Aren't most 'shelfies' are just glo...,I love reading this sub but I really think all...,4938,321,https://www.reddit.com/r/SkincareAddiction/com...,retinol,rough,2018-05-09 02:36:54
...,...,...,...,...,...,...,...,...,...
2075,AusSkincare,Accutane saved me.,"https://i.imgur.com/ulESxbZ.jpg \n\n27 now, I’...",46,31,https://www.reddit.com/r/AusSkincare/comments/...,lactic,sunscreen,2020-07-08 13:38:09
2079,AusSkincare,AIRYDAY SPF Review: Mineral Mousse & Pretty in...,Any questions? Please do ask! And share how yo...,46,37,https://www.reddit.com/gallery/13f6t5h,vitamin c,dry,2023-05-12 02:39:18
2081,AusSkincare,Reviews of popular sunscreens I’ve tried so far,"Been a bit of a sunscreen junkie lately, so th...",44,24,https://www.reddit.com/r/AusSkincare/comments/...,vitamin c,sunscreen,2021-08-23 11:01:07
2094,AusSkincare,Priceline’s free skincare gift bag promotion s...,What I purchased to get the gift bag: \n\nCer...,44,28,https://i.redd.it/9tndg6xsyu1c1.jpg,niacinamide,hyperpigmentation,2023-11-22 08:16:33


In [90]:
df_both.info() #check info 

<class 'pandas.core.frame.DataFrame'>
Index: 446 entries, 25 to 2100
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subreddit     446 non-null    object
 1   title         446 non-null    object
 2   body          446 non-null    object
 3   upvotes       446 non-null    int64 
 4   num_comments  446 non-null    int64 
 5   url           446 non-null    object
 6   ingredient    446 non-null    object
 7   concern       446 non-null    object
 8   date          446 non-null    object
dtypes: int64(2), object(7)
memory usage: 34.8+ KB


#### df_both is now a table of cleaned data which has items which mention both a skincare and a concern