# Combining Green&Sustainability comments between 2015 - 2020


In [1]:
import pandas as pd
a = pd.read_csv("Green-comments-before2020.csv")
b = pd.read_csv("sustainability-comments-before2020.csv")

In [2]:
merged = pd.concat([a,b],ignore_index=True)
merged.head()

Unnamed: 0,id,submission_id,body,created_utc,parent_id,permalink
0,ffbxc48,ess8fe,"An alternative economic model is needed, but N...",1579785808,t3_ess8fe,/r/Green/comments/ess8fe/prince_charles_we_nee...
1,ffc5i6h,ess8fe,\*splutters tea everywhere\* The dude _owns_ t...,1579792001,t3_ess8fe,/r/Green/comments/ess8fe/prince_charles_we_nee...
2,ff8z7yf,esc09d,This is garbage.\n\nHalf of these aren't even ...,1579700709,t3_esc09d,/r/Green/comments/esc09d/did_you_know_all_of_t...
3,ff8znss,esc09d,"Hello, thank you for your comment. \nI am myse...",1579701058,t1_ff8z7yf,/r/Green/comments/esc09d/did_you_know_all_of_t...
4,ff8i5we,es722y,It sounds great using wood and wooden building...,1579679578,t3_es722y,/r/Green/comments/es722y/the_hottest_new_thing...


In [3]:
print('Training Set Shape = {}'.format(merged.shape))
print('Training Set Memory Usage = {:.2f} MB'.format(merged.memory_usage().sum() / 1024**2))

Training Set Shape = (22466, 6)
Training Set Memory Usage = 1.03 MB


### Pandas Profiling Package for text statistics

In [4]:
from pandas_profiling import ProfileReport

In [5]:
profile = ProfileReport(merged)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [6]:
#Now I want to copy my original dataset to a dataframe and I only want to focus on text column which is body

df1 = merged.copy()
df1 = pd.DataFrame(df1, columns = ["id", "body"])

### Removing duplicates

In [7]:
duplicate = df1[df1["body"].duplicated()]
print(len(duplicate))
duplicate.head()

1508


Unnamed: 0,id,body
74,fdh5xp2,"Sorry, your submission has been automatically ..."
94,fcxmq4p,"Sorry, your submission has been automatically ..."
113,fcaxl2s,"Sorry, your submission has been automatically ..."
116,fc9jbg1,"Sorry, your submission has been automatically ..."
124,fc0lstm,[deleted]


**So as you see below, we will remove 1508 duplicates. HOWEVER, these duplicates are not just similar text BUT they are "[deleted]", "[removed]", and bot users**

### Removing unnecessary rows "[deleted]"

In [8]:
df1 = df1.loc[df1["body"] != "[deleted]"]
df1.shape

(21899, 2)

### Removing unnecessary rows "[removed]"

In [9]:
df1 = df1.loc[df1["body"] != "[removed]"]
df1.shape

(21681, 2)

In [10]:
profile1 = ProfileReport(df1)
profile1

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



**As you see in pandas profile, there are some texts that seems a normal text, however, they are just definiton against bot users**


### Removing bots

In [11]:
# the text that appears on top in pandas profile for BOT USERS
df1["body"].value_counts().idxmax()

'Sorry, your submission has been automatically removed. Account age too young, spam likely.\\n\\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Green) if you have any questions or concerns.*'

In [12]:
df1 = df1.loc[df1["body"] != "Sorry, your submission has been automatically removed. Account age too young, spam likely.\\n\\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Green) if you have any questions or concerns.*"]
df1.shape

(21336, 2)

In [13]:
df1 = df1.loc[df1["body"] != "Sorry, your submission has been automatically removed. Not enough comment karma, spam likely.\\n\\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Green) if you have any questions or concerns.*"]
df1.shape

(21197, 2)

In [14]:
df1 = df1.loc[df1["body"] != "Thank you!"]
df1.shape

(21178, 2)

In [15]:
df1 = df1.loc[df1["body"] != '!emissionsbot']
df1.shape

(21173, 2)

In [16]:
df1 = df1.loc[df1["body"] != 'Thank you']
df1.shape

(21168, 2)

In [17]:
df1 = df1.loc[df1["body"] != 'Thank you.']
df1.shape

(21163, 2)

In [18]:
df1 = df1.loc[df1["body"] != '/r/Green does not allow image submissions.\\n\\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Green) if you have any questions or concerns.*']
df1.shape

(21092, 2)

In [19]:
profile2 = ProfileReport(df1)
profile2

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [20]:
df1 = df1.loc[df1["body"] != 'Thanks!']
df1.shape

(21080, 2)

In [21]:
df1 = df1.loc[df1["body"] != 'No']
df1.shape

(21073, 2)

In [22]:
df1 = df1.loc[df1["body"] != '/r/careerguidance']
df1.shape

(21065, 2)

In [23]:
df1 = df1.loc[df1["body"] != 'Lol']
df1.shape

(21061, 2)

In [24]:
df1 = df1.loc[df1["body"] != 'No.']
df1.shape

(21057, 2)

In [25]:
profile3 = ProfileReport(df1, title="Pandas Profiling Report", explorative=True)
profile3

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [28]:
df1 = df1.loc[df1["body"] != 'Source?']
df1.shape

(21053, 2)

In [29]:
df1 = df1.loc[df1["body"] != 'x-post: /r/EnviroNews']
df1.shape

(21049, 2)

In [30]:
df1 = df1.loc[df1["body"] != 'Thank you!!!']
df1.shape

(21045, 2)

In [31]:
df1 = df1.loc[df1["body"] != '?']
df1.shape

(21042, 2)

In [32]:
df1 = df1.loc[df1["body"] != 'Thanks']
df1.shape

(21039, 2)

In [34]:
df1 = df1.loc[df1["body"] != 'Yes']
df1.shape

(21036, 2)

In [35]:
df1 = df1.loc[df1["body"] != 'Yes.']
df1.shape

(21033, 2)

In [37]:
df1 = df1.loc[df1["body"] != 'Thank you!!']
df1.shape

(21030, 2)

In [38]:
df1 = df1.loc[df1["body"] != ':(']
df1.shape

(21027, 2)

In [39]:
profile4 = ProfileReport(df1, title="Pandas Profiling Report", explorative=True)
profile4

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



**So far, I have cleaned the entire data from missing values, unnecessary rows such as deleted and removed ones and also I just removed the frequently used but not added-value rows such as bot reddits. 
Moreover, I understand that Reddit data has different characteristics about not added value rows that are not simply NaN that is why purely removing NaN values is not sufficient but also I had to remove these types of rows:**

1- Removed

2- Deleted

3- Sorry for.....(bots)

In [93]:
df2 = df1.copy()

### Remove Slang Words (Custom Function)

In [94]:
slang_dict = {
"AFAIK": "As Far As I Know",
"AFK": "Away From Keyboard",
"ASAP": "As Soon As Possible",
"ATK": "At The Keyboard",
"ATM": "At The Moment",
"A3": "Anytime, Anywhere, Anyplace",
"BAK": "Back At Keyboard",
"BBL": "Be Back Later",
"BBS": "Be Back Soon",
"BFN": "Bye For Now",
"B4N": "Bye For Now",
"BRB": "Be Right Back",
"BRT": "Be Right There",
"BTW": "By The Way",
"B4": "Before",
"B4N": "Bye For Now",
"CU": "See You",
"CUL8R": "See You Later",
"CYA": "See You",
"FAQ": "Frequently Asked Questions",
"FC": "Fingers Crossed",
"FWIW": "For What It's Worth",
"FYI": "For Your Information",
"GAL": "Get A Life",
"GG": "Good Game",
"GN": "Good Night",
"GMTA": "Great Minds Think Alike",
"GR8": "Great!",
"G9": "Genius",
"OMG": "My God",
"IC": "I See",
"ICQ": "I Seek you (also a chat program)",
"ILU": "ILU: I Love You",
"IMHO": "In My Honest / Humble Opinion",
"IMO": "In My Opinion",
"IOW": "In Other Words",
"IRL": "In Real Life",
"KISS": "Keep It Simple, Stupid",
"LDR": "Long Distance Relationship",
"LMAO": "Laugh My Ass Off",
"LOL": "Laughing Out Loud",
"LTNS": "Long Time No See",
"L8R": "Later",
"MTE": "My Thoughts Exactly",
"M8": "Mate",
"NRN": "No Reply Necessary",
"OIC": "Oh I See",
"PITA": "Pain In The Ass",
"PRT": "Party",
"PRW": "Parents Are Watching",
"ROFL": "Rolling On The Floor Laughing",
"ROFLOL": "Rolling On The Floor Laughing Out Loud",
"ROTFLMAO": "Rolling On The Floor Laughing My Ass Off",
"SK8": "Skate",
"STATS": "Your sex and age",
"ASL": "Age, Sex, Location",
"THX": "Thank You",
"TTFN": "Ta-Ta For Now!",
"TTYL": "Talk To You Later",
"U": "You",
"U2": "You Too",
"U4E": "Yours For Ever",
"WB": "Welcome Back",
"WTF": "What The Fuck",
"WTG": "Way To Go!",
"WUF": "Where Are You From?",
"W8": "Wait"
}


In [95]:
def slang(x):
    if type(x) is str:
        for key in slang_dict:
            value = slang_dict[key]
            x = x.replace(key,value)
        return x
    else:
        return x

In [96]:
# EXAMPLE
slang("Hey Rishabh BRB !!")

'Hey Rishabh Be Right Back !!'

In [97]:
df2["SLANG-ext"] =df2["body"].apply(lambda x: slang(x))

In [98]:
df2.head()

Unnamed: 0,id,body,SLANG-ext
0,ffbxc48,"An alternative economic model is needed, but N...","An alternative economic model is needed, but N..."
1,ffc5i6h,\*splutters tea everywhere\* The dude _owns_ t...,\*splutters tea everywhere\* The dude _owns_ t...
2,ff8z7yf,This is garbage.\n\nHalf of these aren't even ...,This is garbage.\n\nHalf of these aren't even ...
3,ff8znss,"Hello, thank you for your comment. \nI am myse...","Hello, thank you for your comment. \nI am myse..."
4,ff8i5we,It sounds great using wood and wooden building...,It sounds great using wood and wooden building...


### Expand Contractions

In [99]:
contractions_dict = {
"ain't": "am not / are not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is",
"here's": "here is",
"i'd": "I had / I would",
"i'd've": "I would have",
"i'll": "I shall / I will",
"i'll've": "I shall have / I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [100]:
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions_dict:
            value = contractions_dict[key]
            x = x.lower().replace(key,value)
        return x
    else:
        return x

In [101]:
# EXAMPLE
cont_to_exp("What's the best way to ensure this?")

'what has / what is the best way to ensure this?'

In [102]:
df2["Cont-Exp"] =df2["SLANG-ext"].apply(lambda x: cont_to_exp(x))

In [103]:
df2.head()

Unnamed: 0,id,body,SLANG-ext,Cont-Exp
0,ffbxc48,"An alternative economic model is needed, but N...","An alternative economic model is needed, but N...","an alternative economic model is needed, but n..."
1,ffc5i6h,\*splutters tea everywhere\* The dude _owns_ t...,\*splutters tea everywhere\* The dude _owns_ t...,\*splutters tea everywhere\* the dude _owns_ t...
2,ff8z7yf,This is garbage.\n\nHalf of these aren't even ...,This is garbage.\n\nHalf of these aren't even ...,this is garbage.\n\nhalf of these are not / am...
3,ff8znss,"Hello, thank you for your comment. \nI am myse...","Hello, thank you for your comment. \nI am myse...","hello, thank you for your comment. \ni am myse..."
4,ff8i5we,It sounds great using wood and wooden building...,It sounds great using wood and wooden building...,it sounds great using wood and wooden building...


In [74]:
df2 = df2.replace(r'\n',' ', regex=True)

In [75]:
df2 = df2.replace(r'\\n',' ', regex=True)

In [None]:
df2 = df2.replace(r'/r/',' ', regex=True)

In [76]:
df2.head()

Unnamed: 0,id,body,SLANG-ext,Cont-Exp
0,ffbxc48,"An alternative economic model is needed, but N...","An alternative economic model is needed, but N...","an alternative economic model is needed, but n..."
1,ffc5i6h,\*splutters tea everywhere\* The dude _owns_ t...,\*splutters tea everywhere\* The dude _owns_ t...,\*splutters tea everywhere\* the dude _owns_ t...
2,ff8z7yf,This is garbage. Half of these aren't even ne...,This is garbage. Half of these aren't even ne...,this is garbage. half of these are not / am n...
3,ff8znss,"Hello, thank you for your comment. I am mysel...","Hello, thank you for your comment. I am mysel...","hello, thank you for your comment. i am mysel..."
4,ff8i5we,It sounds great using wood and wooden building...,It sounds great using wood and wooden building...,it sounds great using wood and wooden building...


### Text Cleaning

In [104]:
# ONLY for REMOVE STOPWORDS, I used NLTK Library and for the rest of the cleaning process I used NEATTEXT Library.
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [111]:
df2['stopwords-removed'] = df2["Cont-Exp"].apply(lambda x: ' '.join([word for word in str(x).split() if word not in (stop)]))

In [112]:
df2.head()

Unnamed: 0,id,body,SLANG-ext,Cont-Exp,stopwords-removed
0,ffbxc48,"An alternative economic model is needed, but N...","An alternative economic model is needed, but N...","an alternative economic model is needed, but n...","alternative economic model needed, native amer..."
1,ffc5i6h,\*splutters tea everywhere\* The dude _owns_ t...,\*splutters tea everywhere\* The dude _owns_ t...,\*splutters tea everywhere\* the dude _owns_ t...,\*splutters tea everywhere\* dude _owns_ duchy...
2,ff8z7yf,This is garbage.\n\nHalf of these aren't even ...,This is garbage.\n\nHalf of these aren't even ...,this is garbage.\n\nhalf of these are not / am...,garbage.\n\nhalf / even necessarily correct. e...
3,ff8znss,"Hello, thank you for your comment. \nI am myse...","Hello, thank you for your comment. \nI am myse...","hello, thank you for your comment. \ni am myse...","hello, thank comment. \ni dedicated learn stri..."
4,ff8i5we,It sounds great using wood and wooden building...,It sounds great using wood and wooden building...,it sounds great using wood and wooden building...,sounds great using wood wooden buildings look ...


In [106]:
import neattext.functions as nfx

In [107]:
# This directory comes from neattext package to clean our text data.
dir(nfx)

['BTC_ADDRESS_REGEX',
 'CURRENCY_REGEX',
 'CURRENCY_SYMB_REGEX',
 'Counter',
 'DATE_REGEX',
 'EMAIL_REGEX',
 'EMOJI_REGEX',
 'HASTAG_REGEX',
 'MASTERCard_REGEX',
 'MD5_SHA_REGEX',
 'MOST_COMMON_PUNCT_REGEX',
 'NUMBERS_REGEX',
 'PHONE_REGEX',
 'PoBOX_REGEX',
 'SPECIAL_CHARACTERS_REGEX',
 'STOPWORDS',
 'STOPWORDS_de',
 'STOPWORDS_en',
 'STOPWORDS_es',
 'STOPWORDS_fr',
 'STOPWORDS_ru',
 'STOPWORDS_yo',
 'STREET_ADDRESS_REGEX',
 'TextFrame',
 'URL_PATTERN',
 'USER_HANDLES_REGEX',
 'VISACard_REGEX',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__generate_text',
 '__loader__',
 '__name__',
 '__numbers_dict',
 '__package__',
 '__spec__',
 '_lex_richness_herdan',
 '_lex_richness_maas_ttr',
 'clean_text',
 'defaultdict',
 'digit2words',
 'extract_btc_address',
 'extract_currencies',
 'extract_currency_symbols',
 'extract_dates',
 'extract_emails',
 'extract_emojis',
 'extract_hashtags',
 'extract_html_tags',
 'extract_mastercard_addr',
 'extract_md5sha',
 'extract_numbers',
 'extr

In [113]:
def preprocess(df):
    
    df= df.apply(nfx.remove_urls)
    df= df.apply(nfx.remove_special_characters)
    df= df.apply(nfx.remove_emojis)
    df= df.apply(nfx.remove_numbers)
    df= df.apply(nfx.remove_puncts)
    df= df.apply(nfx.remove_hashtags)
    df= df.apply(nfx.remove_currencies)
    df= df.apply(lambda x: x.lower())
    df= df.apply(nfx.remove_multiple_spaces)
    df= df.apply(nfx.remove_stopwords)
    
    return df

In [114]:
df2.head()

Unnamed: 0,id,body,SLANG-ext,Cont-Exp,stopwords-removed
0,ffbxc48,"An alternative economic model is needed, but N...","An alternative economic model is needed, but N...","an alternative economic model is needed, but n...","alternative economic model needed, native amer..."
1,ffc5i6h,\*splutters tea everywhere\* The dude _owns_ t...,\*splutters tea everywhere\* The dude _owns_ t...,\*splutters tea everywhere\* the dude _owns_ t...,\*splutters tea everywhere\* dude _owns_ duchy...
2,ff8z7yf,This is garbage.\n\nHalf of these aren't even ...,This is garbage.\n\nHalf of these aren't even ...,this is garbage.\n\nhalf of these are not / am...,garbage.\n\nhalf / even necessarily correct. e...
3,ff8znss,"Hello, thank you for your comment. \nI am myse...","Hello, thank you for your comment. \nI am myse...","hello, thank you for your comment. \ni am myse...","hello, thank comment. \ni dedicated learn stri..."
4,ff8i5we,It sounds great using wood and wooden building...,It sounds great using wood and wooden building...,it sounds great using wood and wooden building...,sounds great using wood wooden buildings look ...


In [115]:
df2["final"] =[each for each in preprocess(df2['stopwords-removed'])]
df2["final"]

0        alternative economic model needed native ameri...
1        splutters tea dude owns duchy cornwall sure he...
2        garbagennhalf necessarily correct spell hair c...
3        hello thank comment ni dedicated learn strive ...
4        sounds great wood wooden buildings look nice d...
                               ...                        
22461                  descended aliens extra terrestrials
22462    aliens mentioned article saw archeological sto...
22463    called great american interchange major upset ...
22464    good luck maybe post subs results biased cours...
22465    thank yes tried design survey separate green o...
Name: final, Length: 21027, dtype: object

In [117]:
comments = pd.DataFrame(df2, columns = ["id", "final"])
comments.head(3)

Unnamed: 0,id,final
0,ffbxc48,alternative economic model needed native ameri...
1,ffc5i6h,splutters tea dude owns duchy cornwall sure he...
2,ff8z7yf,garbagennhalf necessarily correct spell hair c...


# Combining Green&Sustainability submissions between 2015 - 2020

In [118]:
a = pd.read_csv("Green-submission-before2020.csv")
b = pd.read_csv("sustainability-submission-before2020.csv")

**In REDDIT submissions, there are two types of texts (title and selftext). That is why, I firstly removed bots,[removed], [deleted] in selftext column because pandas profiling shows that only selftext column has these values.**

In [120]:
merged2 = pd.concat([a,b],ignore_index=True)
merged2.head()

Unnamed: 0,id,created_utc,title,selftext,full_link
0,esw9zn,1579800339,AU Brushfire Relief Charity Bundles - Drive Th...,,https://www.reddit.com/r/Green/comments/esw9zn...
1,ess8fe,1579781539,Prince Charles: We need a new economic model o...,,https://www.reddit.com/r/Green/comments/ess8fe...
2,esloit,1579742327,China is moving to phase out single-use plasti...,,https://www.reddit.com/r/Green/comments/esloit...
3,esfx97,1579717268,Microsoft Plans to be Carbon Negative by 2030,,https://www.reddit.com/r/Green/comments/esfx97...
4,escvi1,1579704085,It's not looking good for Big Oil :),,https://www.reddit.com/r/Green/comments/escvi1...


In [121]:
print('Training Set Shape = {}'.format(merged2.shape))
print('Training Set Memory Usage = {:.2f} MB'.format(merged2.memory_usage().sum() / 1024**2))

Training Set Shape = (10773, 5)
Training Set Memory Usage = 0.41 MB


### Pandas Profiling Package for text statistics

In [122]:
from pandas_profiling import ProfileReport

In [123]:
profile4 = ProfileReport(merged2)
profile4

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [124]:
df3 = pd.DataFrame(merged2, columns = ["id", "title", "selftext"])
df3.head(3)

Unnamed: 0,id,title,selftext
0,esw9zn,AU Brushfire Relief Charity Bundles - Drive Th...,
1,ess8fe,Prince Charles: We need a new economic model o...,
2,esloit,China is moving to phase out single-use plasti...,


In [125]:
duplicate2 = df3[df3["selftext"].duplicated()]
print(len(duplicate2))
duplicate2.head()

9395


Unnamed: 0,id,title,selftext
1,ess8fe,Prince Charles: We need a new economic model o...,
2,esloit,China is moving to phase out single-use plasti...,
3,esfx97,Microsoft Plans to be Carbon Negative by 2030,
4,escvi1,It's not looking good for Big Oil :),
6,esbism,"Greenhouse gas 12,000 times more potent than c...",


In [126]:
freefromdeleted= df3[df3["selftext"]=="[deleted]"].dropna()
print(freefromdeleted.shape)
freefromdeleted.head()

(300, 3)


Unnamed: 0,id,title,selftext
699,d7ny9j,Google announces its biggest renewable energy ...,[deleted]
1452,bayowt,This guy.,[deleted]
1555,av5ecv,hm,[deleted]
1739,adlivg,Single use wooden spoons came into circulation...,[deleted]
2470,8ehbln,How Windmills as Wide as Jumbo Jets Are Making...,[deleted]


In [127]:
import numpy as np
df3['selftext'] = df3['selftext'].replace("[deleted]", "")
df3['selftext'] = df3['selftext'].replace("[removed]", "")
df3['selftext'] = df3['selftext'].replace("<not selftext available>", "")
df3['selftext'] = df3['selftext'].replace(np.nan," ")

In [128]:
profile4 = ProfileReport(df3)
profile4

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



**Now it is time to combine two columns into one to analyze further.**

In [129]:
df3["content"] = df3["title"] +" "+ df3["selftext"]
df3.head()

Unnamed: 0,id,title,selftext,content
0,esw9zn,AU Brushfire Relief Charity Bundles - Drive Th...,,AU Brushfire Relief Charity Bundles - Drive Th...
1,ess8fe,Prince Charles: We need a new economic model o...,,Prince Charles: We need a new economic model o...
2,esloit,China is moving to phase out single-use plasti...,,China is moving to phase out single-use plasti...
3,esfx97,Microsoft Plans to be Carbon Negative by 2030,,Microsoft Plans to be Carbon Negative by 2030
4,escvi1,It's not looking good for Big Oil :),,It's not looking good for Big Oil :)


In [133]:
df3["title"][0]

'AU Brushfire Relief Charity Bundles - Drive Thru RPG Fights Fires With Gaming (cross post from  environment)'

In [131]:
df3 = df3.replace(r'/r/',' ', regex=True)

In [134]:
# EXAMPLE
df3["content"][0]

'AU Brushfire Relief Charity Bundles - Drive Thru RPG Fights Fires With Gaming (cross post from  environment)  '

In [135]:
df4 = pd.DataFrame(df3, columns = ["id", "content"])
df4.head(3)

Unnamed: 0,id,content
0,esw9zn,AU Brushfire Relief Charity Bundles - Drive Th...
1,ess8fe,Prince Charles: We need a new economic model o...
2,esloit,China is moving to phase out single-use plasti...


In [136]:
profile5 = ProfileReport(df4)
profile5

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Remove Slang Words (Custom Function)

In [137]:
def slang(x):
    if type(x) is str:
        for key in slang_dict:
            value = slang_dict[key]
            x = x.replace(key,value)
        return x
    else:
        return x

In [138]:
# EXAMPLE
slang("Hey Rishabh BRB !!")

'Hey Rishabh Be Right Back !!'

In [139]:
df4["SLANG-removed"] =df4["content"].apply(lambda x: slang(x))

### Expand Contractions

In [140]:
# EXAMPLE
cont_to_exp("What's the best way to ensure this?")

'what has / what is the best way to ensure this?'

In [141]:
df4["Cont-Exp"] =df4["SLANG-removed"].apply(lambda x: cont_to_exp(x))

# Text Cleaning (Neat Text)

In [142]:
# ONLY for REMOVE STOPWORDS, I used NLTK Library and for the rest of the cleaning process I used NEATTEXT Library.
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [143]:
df4['stopwords-removed'] = df4["Cont-Exp"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [144]:
df4.head()

Unnamed: 0,id,content,SLANG-removed,Cont-Exp,stopwords-removed
0,esw9zn,AU Brushfire Relief Charity Bundles - Drive Th...,AYou Brushfire Relief Charity Bundles - Drive ...,ayou brushfire relief charity bundles - drive ...,ayou brushfire relief charity bundles - drive ...
1,ess8fe,Prince Charles: We need a new economic model o...,Prince Charles: We need a new economic model o...,prince charles: we need a new economic model o...,prince charles: need new economic model planet...
2,esloit,China is moving to phase out single-use plasti...,China is moving to phase out single-use plasti...,china is moving to phase out single-use plasti...,china moving phase single-use plastics startin...
3,esfx97,Microsoft Plans to be Carbon Negative by 2030,Microsoft Plans to be Carbon Negative by 2030,microsoft plans to be carbon negative by 2030,microsoft plans carbon negative 2030
4,escvi1,It's not looking good for Big Oil :),It's not looking good for Big Oil :),it has / it is not looking good for big oil :),/ looking good big oil :)


In [145]:
df4["final"] =[each for each in preprocess(df4['stopwords-removed'])]
df4["final"]

0        ayou brushfire relief charity bundles drive rp...
1        prince charles need new economic model planet ...
2        china moving phase singleuse plastics starting...
3                          microsoft plans carbon negative
4                                     looking good big oil
                               ...                        
10768    mia released new video support hampms world re...
10769         marketing sex age report march swd greenstay
10770                tasmania unplugged mainland australia
10771                          yousgbc integrative process
10772    conducting short environmental ethical apps su...
Name: final, Length: 10773, dtype: object

In [146]:
submissions = pd.DataFrame(df4, columns = ["id", "final"])
submissions.head(3)

Unnamed: 0,id,final
0,esw9zn,ayou brushfire relief charity bundles drive rp...
1,ess8fe,prince charles need new economic model planet ...
2,esloit,china moving phase singleuse plastics starting...


# Final Merge between Submission and Comments

In [147]:
finalmerged = pd.concat([submissions,comments],ignore_index=True)
finalmerged.head()

Unnamed: 0,id,final
0,esw9zn,ayou brushfire relief charity bundles drive rp...
1,ess8fe,prince charles need new economic model planet ...
2,esloit,china moving phase singleuse plastics starting...
3,esfx97,microsoft plans carbon negative
4,escvi1,looking good big oil


In [148]:
finalmerged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31800 entries, 0 to 31799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31800 non-null  object
 1   final   31800 non-null  object
dtypes: object(2)
memory usage: 497.0+ KB


In [149]:
profile6 = ProfileReport(finalmerged)
profile6

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [150]:
# most frequent value in a pandas column
finalmerged["final"].value_counts().idxmax()

''

In [151]:
finalmerged = finalmerged.loc[finalmerged["final"] != '']
finalmerged.shape

(31539, 2)

In [152]:
finalmerged = finalmerged.loc[finalmerged["final"] != 'deleted']
finalmerged.shape

(31514, 2)

In [153]:
finalmerged = finalmerged.loc[finalmerged["final"] != 'thank']
finalmerged.shape

(31491, 2)

In [154]:
finalmerged = finalmerged.loc[finalmerged["final"] != 'thanks']
finalmerged.shape

(31474, 2)

In [155]:
finalmerged = finalmerged.loc[finalmerged["final"] != 'yes']
finalmerged.shape

(31463, 2)

In [156]:
#export to csv
finalmerged.to_csv( "finalmerged.csv", index=False, encoding='utf-8-sig')