# Combining Green&Sustainability comments after 2020

In [1]:
import pandas as pd
a = pd.read_csv("Green-comments-after2020.csv")
b = pd.read_csv("sustainability-comments-after2020.csv")

In [2]:
merged = pd.concat([a,b],ignore_index=True)
merged.head()

Unnamed: 0,id,submission_id,body,created_utc,parent_id,permalink
0,ffidx2m,etp02h,Serious question here.... why isn't the CBC ar...,1579971556,t3_etp02h,/r/Green/comments/etp02h/greta_thunberg_brushe...
1,ffizwc8,etp02h,"Never heard of them, but I do note every post ...",1579986861,t1_ffidx2m,/r/Green/comments/etp02h/greta_thunberg_brushe...
2,ffjgg9k,etp02h,So a spammer then.,1579997722,t1_ffizwc8,/r/Green/comments/etp02h/greta_thunberg_brushe...
3,fhcvwgt,euu2k0,"From what I've heard about Teslas, they handle...",1581471961,t3_euu2k0,/r/Green/comments/euu2k0/busting_3_myths_about...
4,ffulzcx,ev66jd,The article notes the environmental devastatio...,1580244778,t3_ev66jd,/r/Green/comments/ev66jd/why_developing_enviro...


In [3]:
print('Training Set Shape = {}'.format(merged.shape))
print('Training Set Memory Usage = {:.2f} MB'.format(merged.memory_usage().sum() / 1024**2))

Training Set Shape = (58057, 6)
Training Set Memory Usage = 2.66 MB


### Pandas Profiling Package for text statistics

In [5]:
from pandas_profiling import ProfileReport

In [6]:
profile = ProfileReport(merged)
profile

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [40]:
#Now I want to copy my original dataset to a dataframe and I only want to focus on text column which is body

df1 = merged.copy()
df1 = pd.DataFrame(df1, columns = ["id", "body"])

### Removing duplicates

In [41]:
duplicate = df1[df1["body"].duplicated()]
print(len(duplicate))
duplicate.head()

2632


Unnamed: 0,id,body
48,fh7bobq,[deleted]
49,fh7c6oy,[deleted]
50,fh0nedo,[deleted]
55,fh7by6x,[deleted]
65,fh5mir5,"Sorry, your submission has been automatically ..."


**So as you see below, we will remove 2632 duplicates. HOWEVER, these duplicates are not just similar text BUT they are "[deleted]", "[removed]", and bot users**

### Removing unnecessary rows "[deleted]"

In [42]:
df1 = df1.loc[df1["body"] != "[deleted]"]
df1.shape

(57147, 2)

### Removing unnecessary rows "[removed]"

In [43]:
df1 = df1.loc[df1["body"] != "[removed]"]
df1.shape

(56299, 2)

In [44]:
profile1 = ProfileReport(df1)
profile1

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



**As you see in pandas profile, there are some texts that seems a normal text, however, they are just definiton against bot users**


### Removing bots

In [45]:
# the text that appears on top in pandas profile for BOT USERS
df1["body"].value_counts().idxmax()

'Sorry, your submission has been automatically removed. Account age too young, spam likely.\\n\\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Green) if you have any questions or concerns.*'

In [46]:
df1 = df1.loc[df1["body"] != "Sorry, your submission has been automatically removed. Account age too young, spam likely.\\n\\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Green) if you have any questions or concerns.*"]
df1.shape

(56196, 2)

In [47]:
df1 = df1.loc[df1["body"] != "Sorry, your submission has been automatically removed. Not enough comment karma, spam likely.\\n\\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Green) if you have any questions or concerns.*"]
df1.shape

(56137, 2)

In [48]:
df1 = df1.loc[df1["body"] != "Thank you!"]
df1.shape

(56090, 2)

In [49]:
df1 = df1.loc[df1["body"] != 'Thanks!']
df1.shape

(56060, 2)

In [50]:
df1 = df1.loc[df1["body"] != 'No']
df1.shape

(56044, 2)

In [51]:
df1 = df1.loc[df1["body"] != 'No.']
df1.shape

(56030, 2)

In [52]:
df1 = df1.loc[df1["body"] != '/r/Green does not allow image submissions.\\n\\n*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/Green) if you have any questions or concerns.*']
df1.shape

(56016, 2)

In [53]:
df1 = df1.loc[df1["body"] != 'Thank you']
df1.shape

(56003, 2)

In [54]:
df1 = df1.loc[df1["body"] != 'Yes.']
df1.shape

(55991, 2)

In [55]:
df1 = df1.loc[df1["body"] != 'Yes']
df1.shape

(55979, 2)

In [56]:
df1 = df1.loc[df1["body"] != 'Thanks']
df1.shape

(55967, 2)

In [57]:
df1 = df1.loc[df1["body"] != 'This']
df1.shape

(55958, 2)

In [58]:
df1 = df1.loc[df1["body"] != 'Good bot']
df1.shape

(55949, 2)

In [31]:
profile3 = ProfileReport(df1, title="Pandas Profiling Report", explorative=True)
profile3

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



**So far, I have cleaned the entire data from missing values, unnecessary rows such as deleted and removed ones and also I just removed the frequently used but not added-value rows such as bot reddits. 
Moreover, I understand that Reddit data has different characteristics about not added value rows that are not simply NaN that is why purely removing NaN values is not sufficient but also I had to remove these types of rows:**

1- Removed

2- Deleted

3- Sorry for.....(bots)

In [65]:
df2 = df1.copy()

### Remove Slang Words (Custom Function)

In [32]:
slang_dict = {
"AFAIK": "As Far As I Know",
"AFK": "Away From Keyboard",
"ASAP": "As Soon As Possible",
"ATK": "At The Keyboard",
"ATM": "At The Moment",
"A3": "Anytime, Anywhere, Anyplace",
"BAK": "Back At Keyboard",
"BBL": "Be Back Later",
"BBS": "Be Back Soon",
"BFN": "Bye For Now",
"B4N": "Bye For Now",
"BRB": "Be Right Back",
"BRT": "Be Right There",
"BTW": "By The Way",
"B4": "Before",
"B4N": "Bye For Now",
"CU": "See You",
"CUL8R": "See You Later",
"CYA": "See You",
"FAQ": "Frequently Asked Questions",
"FC": "Fingers Crossed",
"FWIW": "For What It's Worth",
"FYI": "For Your Information",
"GAL": "Get A Life",
"GG": "Good Game",
"GN": "Good Night",
"GMTA": "Great Minds Think Alike",
"GR8": "Great!",
"G9": "Genius",
"IC": "I See",
"ICQ": "I Seek you (also a chat program)",
"ILU": "ILU: I Love You",
"IMHO": "In My Honest / Humble Opinion",
"IMO": "In My Opinion",
"IOW": "In Other Words",
"IRL": "In Real Life",
"KISS": "Keep It Simple, Stupid",
"LDR": "Long Distance Relationship",
"LMAO": "Laugh My Ass Off",
"LOL": "Laughing Out Loud",
"LTNS": "Long Time No See",
"L8R": "Later",
"MTE": "My Thoughts Exactly",
"M8": "Mate",
"NRN": "No Reply Necessary",
"OIC": "Oh I See",
"PITA": "Pain In The Ass",
"PRT": "Party",
"PRW": "Parents Are Watching",
"ROFL": "Rolling On The Floor Laughing",
"ROFLOL": "Rolling On The Floor Laughing Out Loud",
"ROTFLMAO": "Rolling On The Floor Laughing My Ass Off",
"SK8": "Skate",
"STATS": "Your sex and age",
"ASL": "Age, Sex, Location",
"THX": "Thank You",
"TTFN": "Ta-Ta For Now!",
"TTYL": "Talk To You Later",
"U": "You",
"U2": "You Too",
"U4E": "Yours For Ever",
"WB": "Welcome Back",
"WTF": "What The Fuck",
"WTG": "Way To Go!",
"WUF": "Where Are You From?",
"W8": "Wait"
}


In [60]:
def slang(x):
    if type(x) is str:
        for key in slang_dict:
            value = slang_dict[key]
            x = x.replace(key,value)
        return x
    else:
        return x

In [61]:
# EXAMPLE
slang("Hey Rishabh BRB !!")

'Hey Rishabh Be Right Back !!'

In [66]:
df2["SLANG-removed"] =df2["body"].apply(lambda x: slang(x))

In [67]:
df2.head()

Unnamed: 0,id,body,SLANG-removed
0,ffidx2m,Serious question here.... why isn't the CBC ar...,Serious question here.... why isn't the CBC ar...
1,ffizwc8,"Never heard of them, but I do note every post ...","Never heard of them, but I do note every post ..."
2,ffjgg9k,So a spammer then.,So a spammer then.
3,fhcvwgt,"From what I've heard about Teslas, they handle...","From what I've heard about Teslas, they handle..."
4,ffulzcx,The article notes the environmental devastatio...,The article notes the environmental devastatio...


### Expand Contractions

In [68]:
contractions_dict = {
"ain't": "am not / are not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is",
"i'd": "I had / I would",
"i'd've": "I would have",
"i'll": "I shall / I will",
"i'll've": "I shall have / I will have",
"i'm": "I am",
"i've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [69]:
def cont_to_exp(x):
    if type(x) is str:
        for key in contractions_dict:
            value = contractions_dict[key]
            x = x.lower().replace(key,value)
        return x
    else:
        return x

In [70]:
# EXAMPLE
cont_to_exp("What's the best way to ensure this?")

'what has / what is the best way to ensure this?'

In [71]:
df2["Cont-Exp"] =df2["SLANG-removed"].apply(lambda x: cont_to_exp(x))

In [72]:
df2.head()

Unnamed: 0,id,body,SLANG-removed,Cont-Exp
0,ffidx2m,Serious question here.... why isn't the CBC ar...,Serious question here.... why isn't the CBC ar...,serious question here.... why is not the cbc a...
1,ffizwc8,"Never heard of them, but I do note every post ...","Never heard of them, but I do note every post ...","never heard of them, but i do note every post ..."
2,ffjgg9k,So a spammer then.,So a spammer then.,so a spammer then.
3,fhcvwgt,"From what I've heard about Teslas, they handle...","From what I've heard about Teslas, they handle...","from what i have heard about teslas, they hand..."
4,ffulzcx,The article notes the environmental devastatio...,The article notes the environmental devastatio...,the article notes the environmental devastatio...


### Text Cleaning

In [73]:
# ONLY for REMOVE STOPWORDS, I used NLTK Library and for the rest of the cleaning process I used NEATTEXT Library.
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [1]:
df2['stopwords-removed'] = df2["Cont-Exp"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

NameError: name 'df2' is not defined

In [77]:
import neattext.functions as nfx

In [78]:
# This directory comes from neattext package to clean our text data.
dir(nfx)

['BTC_ADDRESS_REGEX',
 'CURRENCY_REGEX',
 'CURRENCY_SYMB_REGEX',
 'Counter',
 'DATE_REGEX',
 'EMAIL_REGEX',
 'EMOJI_REGEX',
 'HASTAG_REGEX',
 'MASTERCard_REGEX',
 'MD5_SHA_REGEX',
 'MOST_COMMON_PUNCT_REGEX',
 'NUMBERS_REGEX',
 'PHONE_REGEX',
 'PoBOX_REGEX',
 'SPECIAL_CHARACTERS_REGEX',
 'STOPWORDS',
 'STOPWORDS_de',
 'STOPWORDS_en',
 'STOPWORDS_es',
 'STOPWORDS_fr',
 'STOPWORDS_ru',
 'STOPWORDS_yo',
 'STREET_ADDRESS_REGEX',
 'TextFrame',
 'URL_PATTERN',
 'USER_HANDLES_REGEX',
 'VISACard_REGEX',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__generate_text',
 '__loader__',
 '__name__',
 '__numbers_dict',
 '__package__',
 '__spec__',
 '_lex_richness_herdan',
 '_lex_richness_maas_ttr',
 'clean_text',
 'defaultdict',
 'digit2words',
 'extract_btc_address',
 'extract_currencies',
 'extract_currency_symbols',
 'extract_dates',
 'extract_emails',
 'extract_emojis',
 'extract_hashtags',
 'extract_html_tags',
 'extract_mastercard_addr',
 'extract_md5sha',
 'extract_numbers',
 'extr

In [75]:
def preprocess(df):
    
    df= df.apply(nfx.remove_urls)
    df= df.apply(nfx.remove_special_characters)
    df= df.apply(nfx.remove_emojis)
    df= df.apply(nfx.remove_numbers)
    df= df.apply(nfx.remove_puncts)
    df= df.apply(nfx.remove_hashtags)
    df= df.apply(nfx.remove_currencies)
    df= df.apply(lambda x: x.lower())
    df= df.apply(nfx.remove_multiple_spaces)
    
    return df

In [79]:
df2["final"] =[each for each in preprocess(df2['stopwords-removed'])]
df2["final"]

0        serious question here cbc article posted hell ...
1        never heard them note every post umycojohn lin...
2                                             spammer then
3        heard teslas handle fast charging much better ...
4        article notes environmental devastation lithiu...
                               ...                        
58052    exactly actually saved significant amount stra...
58053    works ph paper nthis unnecessary plastic cover...
58054    sure accurate that put sample proper spot medi...
58055    ok fair thats case done surely reduction mater...
58056    link top cool know addition recycled plastic b...
Name: final, Length: 55949, dtype: object

In [81]:
df2["final"][6]

'check minute talk goto berlin jessica greene software engineer ecosia dropped full talk bio belownnecosia like google search engine however use profits plant trees world wide year celebrated users contributing million trees planted countries ghana colombia achieved working individual partner projects goal use majority profits plant trees engineering decisions need connect user directly purpose ecosia provides platform fits users needs allowing make sustainable choices bring awareness tech usennthis talk discuss ecosia develops features create connection users tree planting projects fund allowing feel personal sense achievement participation let us delve build tech solutions change worldnnwhat audience learn talk nthe audience learn work ecosia connect users real world impact use search engine everyday tech solutions used combating climate changenndoes feature code examples andor live coding npossibly small code examples slide live coding running code'

In [82]:
comments = pd.DataFrame(df2, columns = ["id", "final"])
comments.head(3)

Unnamed: 0,id,final
0,ffidx2m,serious question here cbc article posted hell ...
1,ffizwc8,never heard them note every post umycojohn lin...
2,ffjgg9k,spammer then


# Combining Green&Sustainability submissions after 2020

In [83]:
a = pd.read_csv("Green-submission-after2020.csv")
b = pd.read_csv("sustainability-submission-after2020.csv")

**In REDDIT submissions, there are two types of texts (title and selftext). That is why, I firstly removed bots,[removed], [deleted] in selftext column because pandas profiling shows that only selftext column has these values.**

In [84]:
merged2 = pd.concat([a,b],ignore_index=True)
merged2.head()

Unnamed: 0,id,created_utc,title,selftext,full_link
0,etfi0m,1579895174,America’s New Drinking Problem? The Trump Admi...,\n\n[https://www.youtube.com/watch?v=K1eBxEwi...,https://www.reddit.com/r/Green/comments/etfi0m...
1,etouq6,1579945099,Trump’s gutting of environmental rules is push...,,https://www.reddit.com/r/Green/comments/etouq6...
2,etp02h,1579946285,Greta Thunberg brushes off mockery from U.S. f...,,https://www.reddit.com/r/Green/comments/etp02h...
3,etpv53,1579952926,"'It's a crisis, not a change': the six Guardia...",,https://www.reddit.com/r/Green/comments/etpv53...
4,etsaxl,1579967241,The Future will be Worse than We Thought,,https://www.reddit.com/r/Green/comments/etsaxl...


In [85]:
print('Training Set Shape = {}'.format(merged2.shape))
print('Training Set Memory Usage = {:.2f} MB'.format(merged2.memory_usage().sum() / 1024**2))

Training Set Shape = (12412, 5)
Training Set Memory Usage = 0.47 MB


### Pandas Profiling Package for text statistics

In [86]:
from pandas_profiling import ProfileReport

In [88]:
profile4 = ProfileReport(merged2)
profile4

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [107]:
df3 = pd.DataFrame(merged2, columns = ["id", "title", "selftext"])
df3.head(3)

Unnamed: 0,id,title,selftext
0,etfi0m,America’s New Drinking Problem? The Trump Admi...,\n\n[https://www.youtube.com/watch?v=K1eBxEwi...
1,etouq6,Trump’s gutting of environmental rules is push...,
2,etp02h,Greta Thunberg brushes off mockery from U.S. f...,


In [108]:
duplicate2 = df3[df3["selftext"].duplicated()]
print(len(duplicate2))
duplicate2.head()

9312


Unnamed: 0,id,title,selftext
2,etp02h,Greta Thunberg brushes off mockery from U.S. f...,
3,etpv53,"'It's a crisis, not a change': the six Guardia...",
4,etsaxl,The Future will be Worse than We Thought,
5,eue0zg,even enough sometimes you would like too...you...,
6,euf949,Why bamboo is a super plant which can help com...,


In [109]:
freefromdeleted= df3[df3["selftext"]=="[deleted]"].dropna()
print(freefromdeleted.shape)
freefromdeleted.head()

(129, 3)


Unnamed: 0,id,title,selftext
529,go1qc1,green,[deleted]
605,h8i86m,Montenegrin URA Civic Movement Accepted into G...,[deleted]
697,ho8wlb,This overview of Monsanto's products shows the...,[deleted]
866,ikk1fs,Global Changes Seen from Above - Satellite Tim...,[deleted]
910,isile5,We stopped mowing most of the land and saw a b...,[deleted]


In [111]:
import numpy as np
df3['selftext'] = df3['selftext'].replace("[deleted]", "")
df3['selftext'] = df3['selftext'].replace("[removed]", "")
df3['selftext'] = df3['selftext'].replace("<not selftext available>", "")
df3['selftext'] = df3['selftext'].replace(np.nan," ")

In [96]:
profile4 = ProfileReport(df3)
profile4

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



**Now it is time to combine two columns into one to analyze further.**

In [112]:
df3["content"] = df3["title"] +" "+ df3["selftext"]
df3.head()

Unnamed: 0,id,title,selftext,content
0,etfi0m,America’s New Drinking Problem? The Trump Admi...,\n\n[https://www.youtube.com/watch?v=K1eBxEwi...,America’s New Drinking Problem? The Trump Admi...
1,etouq6,Trump’s gutting of environmental rules is push...,,Trump’s gutting of environmental rules is push...
2,etp02h,Greta Thunberg brushes off mockery from U.S. f...,,Greta Thunberg brushes off mockery from U.S. f...
3,etpv53,"'It's a crisis, not a change': the six Guardia...",,"'It's a crisis, not a change': the six Guardia..."
4,etsaxl,The Future will be Worse than We Thought,,The Future will be Worse than We Thought


In [113]:
df3["title"][0]

'America’s New Drinking Problem? The Trump Administration Rolls Back the Clean Water Act'

In [114]:
df3["selftext"][0]

' \\n\\n[https://www.youtube.com/watch?v=K1eBxEwiqi4&amp;feature=youtu.be](https://www.youtube.com/watch?v=K1eBxEwiqi4&amp;feature=youtu.be)\\n\\nPresident Trump’s EPA just restricted the bodies of water that the federal government can regulate to: the territorial seas and traditional navigable waters, Perennial and intermittent tributaries to those waters, Certain lakes, ponds, and impoundments, and Wetlands adjacent to jurisdictional waters. This unfortunately means that the federal government is no longer enforcing laws over many other waters, including groundwater, wetlands that are not adjacent to large bodies of water, some seasonal streams that flow for only a portion of the year,  and “ephemeral” streams that only flow after rainstorms. Here’s what the means!'

In [115]:
# EXAMPLE
df3["content"][0]

'America’s New Drinking Problem? The Trump Administration Rolls Back the Clean Water Act  \\n\\n[https://www.youtube.com/watch?v=K1eBxEwiqi4&amp;feature=youtu.be](https://www.youtube.com/watch?v=K1eBxEwiqi4&amp;feature=youtu.be)\\n\\nPresident Trump’s EPA just restricted the bodies of water that the federal government can regulate to: the territorial seas and traditional navigable waters, Perennial and intermittent tributaries to those waters, Certain lakes, ponds, and impoundments, and Wetlands adjacent to jurisdictional waters. This unfortunately means that the federal government is no longer enforcing laws over many other waters, including groundwater, wetlands that are not adjacent to large bodies of water, some seasonal streams that flow for only a portion of the year,  and “ephemeral” streams that only flow after rainstorms. Here’s what the means!'

In [116]:
df4 = pd.DataFrame(df3, columns = ["id", "content"])
df4.head(3)

Unnamed: 0,id,content
0,etfi0m,America’s New Drinking Problem? The Trump Admi...
1,etouq6,Trump’s gutting of environmental rules is push...
2,etp02h,Greta Thunberg brushes off mockery from U.S. f...


In [117]:
profile5 = ProfileReport(df4)
profile5

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



### Remove Slang Words (Custom Function)

In [121]:
def slang(x):
    if type(x) is str:
        for key in slang_dict:
            value = slang_dict[key]
            x = x.replace(key,value)
        return x
    else:
        return x

In [120]:
# EXAMPLE
slang("Hey Rishabh BRB !!")

'Hey Rishabh Be Right Back !!'

In [122]:
df4["SLANG-removed"] =df4["content"].apply(lambda x: slang(x))

### Expand Contractions

In [123]:
# EXAMPLE
cont_to_exp("What's the best way to ensure this?")

'what has / what is the best way to ensure this?'

In [124]:
df4["Cont-Exp"] =df4["SLANG-removed"].apply(lambda x: cont_to_exp(x))

# Text Cleaning (Neat Text)

In [125]:
# ONLY for REMOVE STOPWORDS, I used NLTK Library and for the rest of the cleaning process I used NEATTEXT Library.
from nltk.corpus import stopwords
stop = stopwords.words('english')

In [127]:
df4['stopwords-removed'] = df4["Cont-Exp"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

In [128]:
df4.head()

Unnamed: 0,id,content,SLANG-removed,Cont-Exp,stopwords-removed
0,etfi0m,America’s New Drinking Problem? The Trump Admi...,America’s New Drinking Problem? The Trump Admi...,america’s new drinking problem? the trump admi...,america’s new drinking problem? trump administ...
1,etouq6,Trump’s gutting of environmental rules is push...,Trump’s gutting of environmental rules is push...,trump’s gutting of environmental rules is push...,trump’s gutting environmental rules pushing us...
2,etp02h,Greta Thunberg brushes off mockery from U.S. f...,Greta Thunberg brushes off mockery from You.S....,greta thunberg brushes off mockery from you.s....,greta thunberg brushes mockery you.s. finance ...
3,etpv53,"'It's a crisis, not a change': the six Guardia...","'It's a crisis, not a change': the six Guardia...","'it has / it is a crisis, not a change': the s...","'it / crisis, change': six guardian language c..."
4,etsaxl,The Future will be Worse than We Thought,The Future will be Worse than We Thought,the future will be worse than we thought,future worse thought


In [129]:
df4["final"] =[each for each in preprocess(df4['stopwords-removed'])]
df4["final"]

0        americas new drinking problem trump administra...
1        trumps gutting environmental rules pushing us ...
2        greta thunberg brushes mockery yous finance ch...
3        it crisis change six guardian language changes...
4                                     future worse thought
                               ...                        
12407    majority top chemicals market deemed environme...
12408    debate topic prove manufacturing vaccines sust...
12409    going green year ago purchased gots cotton box...
12410             hey saved plastic plastic polution right
12411    googled plastic trash bag alternatives today i...
Name: final, Length: 12412, dtype: object

In [130]:
submissions = pd.DataFrame(df4, columns = ["id", "final"])
submissions.head(3)

Unnamed: 0,id,final
0,etfi0m,americas new drinking problem trump administra...
1,etouq6,trumps gutting environmental rules pushing us ...
2,etp02h,greta thunberg brushes mockery yous finance ch...


# Final Merge between Submission and Comments

In [131]:
finalmerged = pd.concat([submissions,comments],ignore_index=True)
finalmerged.head()

Unnamed: 0,id,final
0,etfi0m,americas new drinking problem trump administra...
1,etouq6,trumps gutting environmental rules pushing us ...
2,etp02h,greta thunberg brushes mockery yous finance ch...
3,etpv53,it crisis change six guardian language changes...
4,etsaxl,future worse thought


In [132]:
finalmerged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68361 entries, 0 to 68360
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      68361 non-null  object
 1   final   68361 non-null  object
dtypes: object(2)
memory usage: 1.0+ MB


In [133]:
profile6 = ProfileReport(finalmerged)
profile6

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [134]:
# most frequent value in a pandas column
finalmerged["final"].value_counts().idxmax()

''

In [135]:
finalmerged = finalmerged.loc[finalmerged["final"] != '']
finalmerged.shape

(67985, 2)

In [136]:
#export to csv
#finalmerged.to_csv( "finalmerged.csv", index=False, encoding='utf-8-sig')