# Project 3 - Deep Cleaning Data for Final Use in EDA and Modeling

In this portion of the Project, we perform a deeper cleaning of our data to prep it fully for EDA and Modeling. Here we will determine the key features that are need to best garner inference from our data and then to predict our subreddits. This is singularly the most important part of this project.

## Contents

- [1.0 Setup - install modules](#one)
- [2.0 Import Data and Setup Pandas Dataframe](#two)
- [3.0 Unrool Json to Columns](#three)
- [4.0 Add Time Series and Subscribers To DataFrame](#four)

## 1.0 Setup - Import Libraries<a name="one"></a>

In [1]:
import warnings
import pandas as pd
import numpy as np
from datetime import datetime
import json
import ast

In [2]:
# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 350

## 2.0 Import Data and Setup Pandas Dataframe <a name="two"></a>

In [3]:
df = pd.read_csv('data/2_Cleaned_IBM_Data/cleaned_ibm.csv')
df.head(5) # Import and Inspect our DataFrame

Unnamed: 0,subreddit,title,status_char_length,status_word_count,response,normalized
0,ProCreate,"Recently got an iPad and have never done digital art before. Not perfect, but I think it’s ok for a beginner.",109,21,"{'usage': {'text_units': 1, 'text_characters': 109, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'digital art', 'sentiment': {'score': -0.701735, 'label': 'negative'}, 'relevance': 0.80561, 'emotion': {'sadness': 0.314979, 'joy': 0.433697, 'fear': 0.054239, 'disgust': 0.043605, 'anger': 0.12908}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 digital art 0.80561 1 -0.701735 negative \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.314979 0.433697 0.054239 0.043605 0.12908
1,ProCreate,Occasionally can't draw in specific spots?,42,6,"{'usage': {'text_units': 1, 'text_characters': 42, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'specific spots', 'sentiment': {'score': 0, 'label': 'neutral'}, 'relevance': 0.5, 'emotion': {'sadness': 0.024311, 'joy': 0.102704, 'fear': 0.033228, 'disgust': 0.041752, 'anger': 0.052201}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 specific spots 0.5 1 0 neutral \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.024311 0.102704 0.033228 0.041752 0.052201
2,ProCreate,First finished painting in procreate! Trying for 31 flowers in January.,71,11,"{'usage': {'text_units': 1, 'text_characters': 71, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'flowers', 'sentiment': {'score': 0, 'label': 'neutral'}, 'relevance': 0.699444, 'emotion': {'sadness': 0.263336, 'joy': 0.557145, 'fear': 0.04873, 'disgust': 0.035199, 'anger': 0.075322}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 flowers 0.699444 1 0 neutral \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.263336 0.557145 0.04873 0.035199 0.075322
3,ProCreate,"I just bought an ipad, and downloaded Procreate, what a intuitive tool! Do you have some great tips maybe?",106,19,"{'usage': {'text_units': 1, 'text_characters': 106, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'great tips', 'sentiment': {'score': 0, 'label': 'neutral'}, 'relevance': 0.763043, 'emotion': {'sadness': 0.032459, 'joy': 0.809301, 'fear': 0.154072, 'disgust': 0.008604, 'anger': 0.004675}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 great tips 0.763043 1 0 neutral \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.032459 0.809301 0.154072 0.008604 0.004675
4,ProCreate,First piece with Procreate– constructive feedback welcome!,58,7,"{'usage': {'text_units': 1, 'text_characters': 58, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'First piece', 'sentiment': {'score': 0.932697, 'label': 'positive'}, 'relevance': 0.998433, 'emotion': {'sadness': 0.017143, 'joy': 0.430622, 'fear': 0.001068, 'disgust': 0.01123, 'anger': 0.051191}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 First piece 0.998433 1 0.932697 positive \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.017143 0.430622 0.001068 0.01123 0.051191


## 3.0 Unspool Response Column<a name="three"></a>

We will unspool the dictionary/list in the response column to extract the sentiment and emotional information we want to gather for our analysis.

In [4]:
test_df = df.copy() # Create a new DataFrame

In [5]:
test_df.tail() # Inspect DataFrame to ensure Copy was done correctly

Unnamed: 0,subreddit,title,status_char_length,status_word_count,response,normalized
7033,AdobeIllustrator,No matter what I do I get this error. Help,42,10,"{'usage': {'text_units': 1, 'text_characters': 42, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'error', 'sentiment': {'score': -0.697416, 'label': 'negative'}, 'relevance': 0.82613, 'emotion': {'sadness': 0.870451, 'joy': 0.006608, 'fear': 0.090859, 'disgust': 0.058557, 'anger': 0.147008}, 'count': 1}]}",text relevance count sentiment.score sentiment.label emotion.sadness \\n0 error 0.82613 1 -0.697416 negative 0.870451 \n\n emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.006608 0.090859 0.058557 0.147008
7034,AdobeIllustrator,Practicing color highlights lately and was very happy with this. Glass Coca Cola,80,13,"{'usage': {'text_units': 1, 'text_characters': 80, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'Glass Coca Cola', 'sentiment': {'score': 0, 'label': 'neutral'}, 'relevance': 0.751211, 'emotion': {'sadness': 0.19857, 'joy': 0.253808, 'fear': 0.118319, 'disgust': 0.113976, 'anger': 0.14925}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 Glass Coca Cola 0.751211 1 0 neutral \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.19857 0.253808 0.118319 0.113976 0.14925
7035,AdobeIllustrator,Illustrator become laggy after editing/resizing artboard. Moving objects and zooming becomes extremely slow. Any idea why?,122,16,"{'usage': {'text_units': 1, 'text_characters': 122, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'Illustrator', 'sentiment': {'score': 0, 'label': 'neutral'}, 'relevance': 0.692774, 'emotion': {'sadness': 0.115133, 'joy': 0.404409, 'fear': 0.464286, 'disgust': 0.036661, 'anger': 0.093767}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 Illustrator 0.692774 1 0 neutral \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.115133 0.404409 0.464286 0.036661 0.093767
7036,AdobeIllustrator,Is there a way to batch-populate folders with similar file names? Possibly using Adobe Bridge?,94,15,"{'usage': {'text_units': 1, 'text_characters': 94, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'similar file names', 'sentiment': {'score': 0, 'label': 'neutral'}, 'relevance': 0.754343, 'emotion': {'sadness': 0.241949, 'joy': 0.105252, 'fear': 0.037448, 'disgust': 0.07314, 'anger': 0.08362}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 similar file names 0.754343 1 0 neutral \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.241949 0.105252 0.037448 0.07314 0.08362
7037,AdobeIllustrator,Hey illustrators... I'm relatively new to illustrator but have taken on a small job for a friend of a friend. Maybe I'm just overtired but what the hell is this thing? I can't make it go away and have no idea how I even got it up in the first place. Any advice appreciated,272,54,"{'usage': {'text_units': 1, 'text_characters': 272, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'small job', 'sentiment': {'score': 0.756212, 'label': 'positive'}, 'relevance': 0.723411, 'emotion': {'sadness': 0.147012, 'joy': 0.661667, 'fear': 0.085309, 'disgust': 0.029754, 'anger': 0.096078}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 small job 0.723411 1 0.756212 positive \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.147012 0.661667 0.085309 0.029754 0.096078


> **SENTIMENT:** The label is the overall sentiment of the document (positive, negative, or neutral). The sentiment label is based on the score; a score of 0.0 would indicate that the document is neutral, a positive number would indicate the document is positive, a negative number would indicate the document is negative.

> **RELEVANCE** The relevance score ranges from 0.0 to 1.0. The higher the score, the more relevant the keyword.

In [6]:
warnings.filterwarnings('ignore') # Supress Warnings
test_df["sadness"] = 0 # Assign a Value of '0' to each new column below
test_df["joy"] = 0
test_df["fear"] = 0
test_df["disgust"] = 0
test_df["anger"] = 0
test_df["keyword"] = 0
test_df["relevance_score"] = 0
test_df["sentiment_score"] = 0

In [7]:
test_df.head(1) # Inspect Data Frame

Unnamed: 0,subreddit,title,status_char_length,status_word_count,response,normalized,sadness,joy,fear,disgust,anger,keyword,relevance_score,sentiment_score
0,ProCreate,"Recently got an iPad and have never done digital art before. Not perfect, but I think it’s ok for a beginner.",109,21,"{'usage': {'text_units': 1, 'text_characters': 109, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'digital art', 'sentiment': {'score': -0.701735, 'label': 'negative'}, 'relevance': 0.80561, 'emotion': {'sadness': 0.314979, 'joy': 0.433697, 'fear': 0.054239, 'disgust': 0.043605, 'anger': 0.12908}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 digital art 0.80561 1 -0.701735 negative \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.314979 0.433697 0.054239 0.043605 0.12908,0,0,0,0,0,0,0,0


In [8]:
test_df['response'] = test_df['response'].apply(lambda x: ast.literal_eval(x)) # convert to Dictionary
# https://stackoverflow.com/questions/39169718/convert-string-to-dict-then-access-keyvalues-how-to-access-data-in-a-class

In [9]:
for i in range(0,len(test_df)):
    try:
        test_df["sadness"][i] = str(test_df['response'][i]['keywords'][0]['emotion']['sadness'])
        test_df["joy"][i] = str(test_df['response'][i]['keywords'][0]['emotion']['joy'])
        test_df["fear"][i] = str(test_df['response'][i]['keywords'][0]['emotion']['fear'])
        test_df["disgust"][i] = str(test_df['response'][i]['keywords'][0]['emotion']['disgust'])
        test_df["anger"][i] = str(test_df['response'][i]['keywords'][0]['emotion']['anger'])
        test_df["keyword"][i] = str(test_df['response'][i]['keywords'][0]['text'])
        test_df["relevance_score"][i] = str(test_df['response'][i]['keywords'][0]['relevance'])
        test_df["sentiment_score"][i] = str(test_df['response'][i]['keywords'][0]['sentiment']['score'])
    except:
        continue

In [10]:
test_df.tail(2)

Unnamed: 0,subreddit,title,status_char_length,status_word_count,response,normalized,sadness,joy,fear,disgust,anger,keyword,relevance_score,sentiment_score
7036,AdobeIllustrator,Is there a way to batch-populate folders with similar file names? Possibly using Adobe Bridge?,94,15,"{'usage': {'text_units': 1, 'text_characters': 94, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'similar file names', 'sentiment': {'score': 0, 'label': 'neutral'}, 'relevance': 0.754343, 'emotion': {'sadness': 0.241949, 'joy': 0.105252, 'fear': 0.037448, 'disgust': 0.07314, 'anger': 0.08362}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 similar file names 0.754343 1 0 neutral \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.241949 0.105252 0.037448 0.07314 0.08362,0.241949,0.105252,0.037448,0.07314,0.08362,similar file names,0.754343,0.0
7037,AdobeIllustrator,Hey illustrators... I'm relatively new to illustrator but have taken on a small job for a friend of a friend. Maybe I'm just overtired but what the hell is this thing? I can't make it go away and have no idea how I even got it up in the first place. Any advice appreciated,272,54,"{'usage': {'text_units': 1, 'text_characters': 272, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'small job', 'sentiment': {'score': 0.756212, 'label': 'positive'}, 'relevance': 0.723411, 'emotion': {'sadness': 0.147012, 'joy': 0.661667, 'fear': 0.085309, 'disgust': 0.029754, 'anger': 0.096078}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 small job 0.723411 1 0.756212 positive \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.147012 0.661667 0.085309 0.029754 0.096078,0.147012,0.661667,0.085309,0.029754,0.096078,small job,0.723411,0.756212


In [11]:
test_df.dtypes # Check DataTypes to ensure scores have been converted to Numbers

subreddit             object
title                 object
status_char_length     int64
status_word_count      int64
response              object
normalized            object
sadness               object
joy                   object
fear                  object
disgust               object
anger                 object
keyword               object
relevance_score       object
sentiment_score       object
dtype: object

> Convert String to Integers

In [12]:
test_df["sadness"] = test_df["sadness"].astype(float)
test_df["joy"] = test_df["joy"].astype(float)
test_df["fear"] = test_df["fear"].astype(float)
test_df["disgust"] = test_df["disgust"].astype(float)
test_df["anger"] = test_df["anger"].astype(float)
test_df["relevance_score"] = test_df["relevance_score"].astype(float)
test_df["sentiment_score"] = test_df["sentiment_score"].astype(float)

In [13]:
test_df.dtypes

subreddit              object
title                  object
status_char_length      int64
status_word_count       int64
response               object
normalized             object
sadness               float64
joy                   float64
fear                  float64
disgust               float64
anger                 float64
keyword                object
relevance_score       float64
sentiment_score       float64
dtype: object

## 4.0 Add Date and Subscribers To DataFrame<a name="four"></a>

> We add data and Subscriber count to Current DataFrame to prep for Final Analysis

In [14]:
df_time = pd.read_csv('/data/1_Scraped_Data_IBM/ibm_watson_time.csv') # Import DataFrame

Unnamed: 0,created_utc,subreddit_subscribers,title
0,1546313619,4227,Giving this a go!
1,1546320962,4228,"Recently got an iPad and have never done digital art before. Not perfect, but I think it’s ok for a beginner."
2,1546326030,4230,Occasionally can't draw in specific spots?
3,1546367646,4238,Day 1 • 365 challenge
4,1546397226,4251,First finished painting in procreate! Trying for 31 flowers in January.


In [15]:
df_time.shape # Check Current Shape of DataFrame

(10011, 3)

In [16]:
df_time['status_word_count'] = [len(df_time['title'][i].split()) for i in range(0,df_time['title'].shape[0])]
# Split on the spaces and then count the number of words with Lenght in-built function

In [17]:
df_time.shape # Check Current Shape to ensure row was added

(10011, 4)

In [18]:
df_time = df_time[df_time['status_word_count'] > 5]
# Drop rows with less than 5 words to align with IBM data frame we want to Concat to

In [19]:
df_time.reset_index(inplace = True, drop = True)
# Reset Index and ensure dataframe updates automatically

In [20]:
df_time.shape # Checking shape to ensure this was done correctly

(7038, 4)

In [21]:
test_df.shape # df_time.shape(0) equals test_df.shape(0) so we are fine

(7038, 14)

In [22]:
test_df['time'] = df_time['created_utc'] # Add epoch time to new DataFrame

In [23]:
test_df['subscribers'] = df_time['subreddit_subscribers'] # Add Subreddit Subscribers to new time dataframe

In [24]:
test_df[test_df['subreddit'] == 'ProCreate'].head(1) # Inspect the head

Unnamed: 0,subreddit,title,status_char_length,status_word_count,response,normalized,sadness,joy,fear,disgust,anger,keyword,relevance_score,sentiment_score,time,subscribers
0,ProCreate,"Recently got an iPad and have never done digital art before. Not perfect, but I think it’s ok for a beginner.",109,21,"{'usage': {'text_units': 1, 'text_characters': 109, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'digital art', 'sentiment': {'score': -0.701735, 'label': 'negative'}, 'relevance': 0.80561, 'emotion': {'sadness': 0.314979, 'joy': 0.433697, 'fear': 0.054239, 'disgust': 0.043605, 'anger': 0.12908}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 digital art 0.80561 1 -0.701735 negative \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.314979 0.433697 0.054239 0.043605 0.12908,0.314979,0.433697,0.054239,0.043605,0.12908,digital art,0.80561,-0.701735,1546320962,4228


In [25]:
test_df['date'] = pd.to_datetime(test_df['time'],unit='s') # Convert Epoch Time to standard date format

In [26]:
test_df.head(1) # Inspect Head to ensure change above executed correctly

Unnamed: 0,subreddit,title,status_char_length,status_word_count,response,normalized,sadness,joy,fear,disgust,anger,keyword,relevance_score,sentiment_score,time,subscribers,date
0,ProCreate,"Recently got an iPad and have never done digital art before. Not perfect, but I think it’s ok for a beginner.",109,21,"{'usage': {'text_units': 1, 'text_characters': 109, 'features': 1}, 'language': 'en', 'keywords': [{'text': 'digital art', 'sentiment': {'score': -0.701735, 'label': 'negative'}, 'relevance': 0.80561, 'emotion': {'sadness': 0.314979, 'joy': 0.433697, 'fear': 0.054239, 'disgust': 0.043605, 'anger': 0.12908}, 'count': 1}]}",text relevance count sentiment.score sentiment.label \\n0 digital art 0.80561 1 -0.701735 negative \n\n emotion.sadness emotion.joy emotion.fear emotion.disgust emotion.anger \n0 0.314979 0.433697 0.054239 0.043605 0.12908,0.314979,0.433697,0.054239,0.043605,0.12908,digital art,0.80561,-0.701735,1546320962,4228,2019-01-01 05:36:02


In [27]:
test_df.drop(['response','normalized','keyword','time','relevance_score'], axis=1, inplace=True)
# Drop Columns we don't need

In [28]:
test_df.head(1) # Inspect Head to ensure change above executed correctly

Unnamed: 0,subreddit,title,status_char_length,status_word_count,sadness,joy,fear,disgust,anger,sentiment_score,subscribers,date
0,ProCreate,"Recently got an iPad and have never done digital art before. Not perfect, but I think it’s ok for a beginner.",109,21,0.314979,0.433697,0.054239,0.043605,0.12908,-0.701735,4228,2019-01-01 05:36:02


In [29]:
# Write the DataFrame you created to a csv called 'final.csv'
test_df.to_csv('data/3_Final_Cleaning/final.csv', index=False)
print('Submission CSV is ready!')

Submission CSV is ready!
