# 02_data_preparation notebook

**the objective of this notebook is to read, explore, and clean the data for both subreddit dataframes.**

## 1. Imports

In [46]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

## 2. Read CSVs

In [47]:
millenials_df = pd.read_csv('../data/millenials_raw.csv')
genz_df = pd.read_csv('../data/genz_raw.csv')

## 3. Data Cleaning

In [54]:
millenials_df.head()

Unnamed: 0,id,created_utc,title,self_text,num_comments,num_upvotes,upvote_ratio,subreddit
0,17s1aa5,1699613000.0,Do you feel dissillusioned with social media?,It's not difficult to argue that the user expe...,132,112,0.98,millenials
1,1ccrau6,1714050000.0,Yesterday I noticed a Lamborghini beside me in...,…was a time…,190,439,0.93,millenials
2,1ccwkzr,1714063000.0,Going through a midlife crisis,I have been realizing recently that I am going...,47,25,0.9,millenials
3,1cbwkmb,1713961000.0,It's funny how get a degree in anything has tu...,Had an interesting thought this morning. Obvio...,2026,4972,0.86,millenials
4,1ccjb7c,1714020000.0,Does anyone else's parents get angry when you ...,"For example, I have been separated from my son...",29,73,0.97,millenials


In [55]:
genz_df.head()

Unnamed: 0,id,created_utc,title,self_text,num_comments,num_upvotes,upvote_ratio,subreddit
0,1cco3ai,1714039000.0,What movies/TV shows have you been watching th...,"Animated, live-action, anime, etc.\n\nPlease m...",8,8,1.0,GenZ
1,1ccp0cg,1714043000.0,"So guys, whats your position on the roundabout?","I am a big fan of the roundabout, albeit, they...",1507,2083,0.86,GenZ
2,1ccyjg2,1714070000.0,Self love is not buying yourself nice things a...,Self-love is delaying gratification with exerc...,213,275,0.68,GenZ
3,1ccpw52,1714046000.0,Pressure when you turn 25-30,I feel a lot of people around our age have thi...,238,507,0.96,GenZ
4,1ccup50,1714059000.0,What is everyone's favourite dinosaur?,,239,246,0.97,GenZ


In [56]:
millenials_df.shape

(958, 8)

In [57]:
genz_df.shape

(917, 8)

### convert UTC time

In [58]:
# function to conver the create_UTC date and time to readable format
def convert_utctime(i):
    return datetime.datetime.utcfromtimestamp(i).strftime('%Y-%m-%d %H:%M:%S')

the 'create_utc' is currently a object of floats, use to_numeric to convert it to float type.
Then create a new column 'converted_utc' to store the converted data.

In [59]:
millenials_df['converted_utc'] = pd.to_numeric(millenials_df['created_utc'], errors='coerce')
genz_df['converted_utc'] = pd.to_numeric(genz_df['created_utc'], errors='coerce')

apply 'convert_utctime' function to both dataframes. Save the converted UTC data back to the "created_utc" column

In [60]:
millenials_df['created_utc'] = millenials_df['converted_utc'].apply(convert_utctime)
genz_df['created_utc'] = genz_df['converted_utc'].apply(convert_utctime)

drop the 'converted_utc' temporary columns and inspect

In [61]:
millenials_df.drop(['converted_utc'], axis=1, inplace = True)

In [62]:
genz_df.drop(['converted_utc'], axis=1, inplace = True)

### check for nulls

In [63]:
millenials_df.isnull().sum()

id                0
created_utc       0
title             0
self_text       255
num_comments      0
num_upvotes       0
upvote_ratio      0
subreddit         0
dtype: int64

In [64]:
genz_df.isnull().sum()

id                0
created_utc       0
title             0
self_text       267
num_comments      0
num_upvotes       0
upvote_ratio      0
subreddit         0
dtype: int64


**'self_text' is the the submissions’ selftext, and this will be an empty string if the post is only a link.
There are 255/958 posts with links only and no 'selftexts' in r/millenials subreddit and there are 267/917 posts with no 'selftexts' in r/genZ.
for empty data rows in 'self_text', fill it with the string "no_text".
This shows that the many posts in the top 1000 posts of all time in both subreddits do not have selftexts information, and the 'selftexts' are not the data that I want to use for the classification models later.**

In [65]:
millenials_df['self_text'].fillna('no_text', inplace=True)

In [66]:
genz_df['self_text'].fillna('no_text', inplace=True)

## 3. Create New Features

#### Title Length

create a new column to store the length of word counts in each post title in both subreddit dataframes.

In [67]:
# Word count of post titles
millenials_df['title_length'] = millenials_df['title'].apply(lambda title: len(str(title).split()))
genz_df['title_length'] = genz_df['title'].apply(lambda title: len(str(title).split()))

#### Day of the Week

create a new column to store the "day of the week" that the post was created in both subreddit dataframes.
use to_datetime to convert the 'create_utc' to a datetime object. Use dt.day_name() to find the day of the week.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.day_name.html

In [68]:
millenials_df['created_utc'] = pd.to_datetime(millenials_df['created_utc'])
genz_df['created_utc'] = pd.to_datetime(genz_df['created_utc'])

In [69]:
millenials_df['day_of_week'] = millenials_df['created_utc'].dt.day_name()
genz_df['day_of_week'] = genz_df['created_utc'].dt.day_name()

In [70]:
millenials_df.head()

Unnamed: 0,id,created_utc,title,self_text,num_comments,num_upvotes,upvote_ratio,subreddit,title_length,day_of_week
0,17s1aa5,2023-11-10 10:37:15,Do you feel dissillusioned with social media?,It's not difficult to argue that the user expe...,132,112,0.98,millenials,7,Friday
1,1ccrau6,2024-04-25 12:59:13,Yesterday I noticed a Lamborghini beside me in...,…was a time…,190,439,0.93,millenials,22,Thursday
2,1ccwkzr,2024-04-25 16:36:49,Going through a midlife crisis,I have been realizing recently that I am going...,47,25,0.9,millenials,5,Thursday
3,1cbwkmb,2024-04-24 12:12:05,It's funny how get a degree in anything has tu...,Had an interesting thought this morning. Obvio...,2026,4972,0.86,millenials,17,Wednesday
4,1ccjb7c,2024-04-25 04:44:28,Does anyone else's parents get angry when you ...,"For example, I have been separated from my son...",29,73,0.97,millenials,15,Thursday


#### Segment of the Day (morning, afternoon, evening, night)

In [71]:
#function to define the segment of the day that the post was created:
def time_of_day(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

In [72]:
millenials_df['segment_of_day'] = millenials_df['created_utc'].dt.hour.apply(time_of_day)
genz_df['segment_of_day'] = genz_df['created_utc'].dt.hour.apply(time_of_day)

In [73]:
millenials_df.head()

Unnamed: 0,id,created_utc,title,self_text,num_comments,num_upvotes,upvote_ratio,subreddit,title_length,day_of_week,segment_of_day
0,17s1aa5,2023-11-10 10:37:15,Do you feel dissillusioned with social media?,It's not difficult to argue that the user expe...,132,112,0.98,millenials,7,Friday,Morning
1,1ccrau6,2024-04-25 12:59:13,Yesterday I noticed a Lamborghini beside me in...,…was a time…,190,439,0.93,millenials,22,Thursday,Afternoon
2,1ccwkzr,2024-04-25 16:36:49,Going through a midlife crisis,I have been realizing recently that I am going...,47,25,0.9,millenials,5,Thursday,Afternoon
3,1cbwkmb,2024-04-24 12:12:05,It's funny how get a degree in anything has tu...,Had an interesting thought this morning. Obvio...,2026,4972,0.86,millenials,17,Wednesday,Afternoon
4,1ccjb7c,2024-04-25 04:44:28,Does anyone else's parents get angry when you ...,"For example, I have been separated from my son...",29,73,0.97,millenials,15,Thursday,Night


In [74]:
genz_df.head()

Unnamed: 0,id,created_utc,title,self_text,num_comments,num_upvotes,upvote_ratio,subreddit,title_length,day_of_week,segment_of_day
0,1cco3ai,2024-04-25 10:00:31,What movies/TV shows have you been watching th...,"Animated, live-action, anime, etc.\n\nPlease m...",8,8,1.0,GenZ,9,Thursday,Morning
1,1ccp0cg,2024-04-25 10:58:03,"So guys, whats your position on the roundabout?","I am a big fan of the roundabout, albeit, they...",1507,2083,0.86,GenZ,8,Thursday,Morning
2,1ccyjg2,2024-04-25 18:26:04,Self love is not buying yourself nice things a...,Self-love is delaying gratification with exerc...,213,275,0.68,GenZ,11,Thursday,Evening
3,1ccpw52,2024-04-25 11:47:44,Pressure when you turn 25-30,I feel a lot of people around our age have thi...,238,507,0.96,GenZ,5,Thursday,Morning
4,1ccup50,2024-04-25 15:22:31,What is everyone's favourite dinosaur?,no_text,239,246,0.97,GenZ,5,Thursday,Afternoon


### combine 2 subreddits into one dataframe

In [75]:
full_df = pd.concat([millenials_df, genz_df], axis = 0)

inspect the data type for each data column

In [76]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1875 entries, 0 to 916
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   id              1875 non-null   object        
 1   created_utc     1875 non-null   datetime64[ns]
 2   title           1875 non-null   object        
 3   self_text       1875 non-null   object        
 4   num_comments    1875 non-null   int64         
 5   num_upvotes     1875 non-null   int64         
 6   upvote_ratio    1875 non-null   float64       
 7   subreddit       1875 non-null   object        
 8   title_length    1875 non-null   int64         
 9   day_of_week     1875 non-null   object        
 10  segment_of_day  1875 non-null   object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(6)
memory usage: 175.8+ KB


## 4. Export full DataFrame to csv

In [79]:
full_df.to_csv('../data/full_df.csv', index = False)

In [80]:
millenials_df.to_csv('../data/millenials_clean.csv', index = False)
genz_df.to_csv('../data/genz_clean.csv', index = False)

Unnamed: 0,id,created_utc,title,self_text,num_comments,num_upvotes,upvote_ratio,subreddit,title_length,day_of_week,segment_of_day
0,17s1aa5,2023-11-10 10:37:15,Do you feel dissillusioned with social media?,It's not difficult to argue that the user expe...,132,112,0.98,millenials,7,Friday,Morning
1,1ccrau6,2024-04-25 12:59:13,Yesterday I noticed a Lamborghini beside me in...,…was a time…,190,439,0.93,millenials,22,Thursday,Afternoon
2,1ccwkzr,2024-04-25 16:36:49,Going through a midlife crisis,I have been realizing recently that I am going...,47,25,0.90,millenials,5,Thursday,Afternoon
3,1cbwkmb,2024-04-24 12:12:05,It's funny how get a degree in anything has tu...,Had an interesting thought this morning. Obvio...,2026,4972,0.86,millenials,17,Wednesday,Afternoon
4,1ccjb7c,2024-04-25 04:44:28,Does anyone else's parents get angry when you ...,"For example, I have been separated from my son...",29,73,0.97,millenials,15,Thursday,Night
...,...,...,...,...,...,...,...,...,...,...,...
912,1c2uxjx,2024-04-13 06:01:46,Upside down by Jack Johnson,Who else gets hit by nostalgia whenever they h...,2,1,0.67,GenZ,5,Saturday,Morning
913,1c2app9,2024-04-12 14:34:39,How to become a boomer,I want to feel like a boomer,22,16,0.75,GenZ,5,Friday,Afternoon
914,1c2q8ql,2024-04-13 01:36:20,Want to move out?,Hi. I made a discord server for resources on a...,1,3,0.71,GenZ,4,Saturday,Night
915,1bzn4ch,2024-04-09 08:35:05,How do us GenZ’s feel about this?,no_text,1741,33095,0.94,GenZ,7,Tuesday,Morning
