# Project 3 - Concating SubREDDITS and doing Preliminary Cleaning

In this portion of the Project, we'll concat both subreddits scraped data and trim it down to focus on the more active subreddit posts, which are the ones with a good amount of comments. We will also create 2 new columns for further analysis down the line.

In [1]:
# Data Manipulation
import numpy as np
import pandas as pd

In [2]:
# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 350

In [3]:
pr_df =  pd.read_csv('data/0_Scraped_Data/procreate_final.csv')

In [4]:
ai_df = pd.read_csv('data/0_Scraped_Data/adobeillustrator.csv')

In [5]:
pr_df.head(1)

Unnamed: 0,created_utc,full_link,num_comments,score,selftext,subreddit,subreddit_subscribers,title
0,1546306570,https://www.reddit.com/r/ProCreate/comments/abdonl/which_ipad_pro_should_i_get_to_use_procreate/,2,1,"Hi!\nAre there very noticeable differences between the new ipad pro and the new apple pencil and the previous version of both? Other than the awkward charging of the pencil, I guess i can live with that to save some money. \n\nI will mostly use it for illustrating for fun on procreate. \n\nThanks!",ProCreate,4223,Which ipad pro should i get to use procreate?


In [6]:
ai_df.head(1)

Unnamed: 0,created_utc,full_link,num_comments,score,selftext,subreddit,subreddit_subscribers,title
0,1546306849,https://www.reddit.com/r/AdobeIllustrator/comments/abdpyu/how_to_when_using_the_blend_tool_i_try_to_blend/,0,1,Extra info: the specified strokes in the middle aren't following the style of the main strokes.,AdobeIllustrator,25285,"HOW TO: When using the blend tool, I try to blend two strokes with a style applied to both ( like the different width profiles ) and the blended strokes aren't affected. How do I fix this?"


In [7]:
# Trim rows down

In [8]:
pr_df = pr_df[(pr_df['num_comments'] > 4) & (pr_df['selftext'] != '[deleted]')]

In [9]:
ai_df = ai_df[(ai_df['num_comments'] > 4) & (ai_df['selftext'] != '[deleted]')]

In [10]:
project_ai_df = pd.DataFrame(data=ai_df, columns = ['subreddit','title'])

In [11]:
project_pr_df = pd.DataFrame(data=pr_df, columns = ['subreddit','title'])

In [12]:
project_ai_df.head(1)

Unnamed: 0,subreddit,title
1,AdobeIllustrator,"Wanna start using illustrator to draw, mouse vs drawing tablet?"


In [13]:
project_pr_df.head(1)

Unnamed: 0,subreddit,title
1,ProCreate,Giving this a go!


In [14]:
frames = [project_pr_df, project_ai_df]
ai_pr_df = pd.concat(frames, axis = 0) # Concatenate the dataframes row to row

In [15]:
ai_pr_df.head(1)

Unnamed: 0,subreddit,title
1,ProCreate,Giving this a go!


In [16]:
ai_pr_df.tail(1)

Unnamed: 0,subreddit,title
20595,AdobeIllustrator,Jaguar XKSS poster design:)


In [17]:
ai_pr_df.reset_index(drop=True, inplace = True)

### Create DataFrame Column With Our Time

In [18]:
frames = [pr_df, ai_df]
ai_pr_time = pd.concat(frames, axis = 0) # Concatenate the dataframes row to row

In [19]:
ai_pr_time.head(1)

Unnamed: 0,created_utc,full_link,num_comments,score,selftext,subreddit,subreddit_subscribers,title
1,1546313619,https://www.reddit.com/r/ProCreate/comments/abejip/giving_this_a_go/,7,1,,ProCreate,4227,Giving this a go!


In [20]:
ai_pr_time.drop(['full_link','num_comments','score','selftext','subreddit'], axis=1, inplace=True)

In [21]:
ai_pr_time.head(1)

Unnamed: 0,created_utc,subreddit_subscribers,title
1,1546313619,4227,Giving this a go!


In [22]:
# Write the DataFrame you created to a csv called 'predictions.csv'
ai_pr_time.to_csv('data/1_Scraped_Data_IBM/ibm_watson_time.csv', index=False)
print('Submission CSV is ready!')

Submission CSV is ready!


### Create a new column called `status_char_length` that contains the character length of each status

> Note: You can do this in one line with `map`.

In [23]:
## Title lenghts
ai_pr_df['status_char_length'] = [len(ai_pr_df['title'][i]) for i in range(0,ai_pr_df['title'].shape[0])]

### Create a new column called `status_word_count` that contains the number of words in each status

> Note: You can evaluate this based off of how many strings are separated by whitespaces; you're not required to check that each set of characters set apart by whitespaces is a word in the dictionary.

In [24]:
ai_pr_df['status_word_count'] = [len(ai_pr_df['title'][i].split()) for i in range(0,ai_pr_df['title'].shape[0])] # Split on the spaces and then count the number of words with Lenght in-built function

In [25]:
ai_pr_df.head()

Unnamed: 0,subreddit,title,status_char_length,status_word_count
0,ProCreate,Giving this a go!,17,4
1,ProCreate,"Recently got an iPad and have never done digital art before. Not perfect, but I think it’s ok for a beginner.",109,21
2,ProCreate,Occasionally can't draw in specific spots?,42,6
3,ProCreate,Day 1 • 365 challenge,21,5
4,ProCreate,First finished painting in procreate! Trying for 31 flowers in January.,71,11


In [26]:
# Write the DataFrame you created to a csv called 'predictions.csv'
ai_pr_df.to_csv('data/1_Scraped_Data_IBM/ibm_watson.csv', index=False)
print('Submission CSV is ready!')

Submission CSV is ready!


“I am feeling 🤔 today"