## Clean Reddit Posts.
In this notebook I will clean the reddit posts. This is essential step in Machine Learning and Deep Learning. The following steps were taken to clean the text.
* Remove HTML tags, if any, in the text.
* Remove accented characters.
* Expand the contractions.
* Remove special characters.
* Remove common words like Hi, Hey, Hello etc.  
I am not removing stopwords as they will be handled in individual notebooks.


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!pip install textsearch
!pip install contractions

Collecting textsearch
  Downloading https://files.pythonhosted.org/packages/42/a8/03407021f9555043de5492a2bd7a35c56cc03c2510092b5ec018cae1bbf1/textsearch-0.0.17-py2.py3-none-any.whl
Collecting pyahocorasick (from textsearch)
[?25l  Downloading https://files.pythonhosted.org/packages/f4/9f/f0d8e8850e12829eea2e778f1c90e3c53a9a799b7f412082a5d21cd19ae1/pyahocorasick-1.4.0.tar.gz (312kB)
[K     |████████████████████████████████| 317kB 3.9MB/s 
[?25hCollecting Unidecode (from textsearch)
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 34.7MB/s 
[?25hBuilding wheels for collected packages: pyahocorasick
  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone
  Created wheel for pyahocorasick: filename=pyahocorasick-1.4.0-cp36-cp36m-linux_x86_64.whl size=81702 sha256=015d778793164eaf7bdbf2806ecf64b4e42b043b3b

In [0]:
import numpy as np
import pandas as pd
import nltk
import re
from bs4 import BeautifulSoup
import unicodedata
import contractions
import spacy
import nltk
import tqdm
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [0]:
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/reddit_posts.csv')
df.head()

Unnamed: 0,Title,Body,SubReddit
0,What things about React annoy you the most?,"Can be anything: missing features, boilerplate...",reactjs
1,Tutorial: Building a contacts manager using Vu...,,vuejs
2,Having trouble deciding what design pattern sh...,The structure of my project is the following:\...,vuejs
3,Prettier rule for this?,I use Prettier and the auto format on save opt...,reactjs
4,Conditional Rendering in Vue JS - Beginner Tut...,,vuejs


In [0]:
df['Title'] = df['Title'].fillna('missing')
df['Body'] = df['Body'].fillna('missing')
df.head()

Unnamed: 0,Title,Body,SubReddit
0,What things about React annoy you the most?,"Can be anything: missing features, boilerplate...",reactjs
1,Tutorial: Building a contacts manager using Vu...,missing,vuejs
2,Having trouble deciding what design pattern sh...,The structure of my project is the following:\...,vuejs
3,Prettier rule for this?,I use Prettier and the auto format on save opt...,reactjs
4,Conditional Rendering in Vue JS - Beginner Tut...,missing,vuejs


In [0]:
nlp = spacy.load('en', parse=False, tag=False, entity=False)
ps = nltk.porter.PorterStemmer()


def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    [s.extract() for s in soup(['iframe', 'script'])]
    stripped_text = soup.get_text()
    stripped_text = re.sub(r'[\r|\n|\r\n]+', '\n', stripped_text)
    return stripped_text


def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text


def expand_contractions(text):
    return contractions.fix(text)


def spacy_lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text


def simple_stemming(text, stemmer=ps):
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    return text

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-Z0-9\s]' if not remove_digits else r'[^a-zA-Z\s]'
    text = re.sub(pattern, '', text)
    return text


def remove_stopwords(text, is_lower_case=False, stopwords=None):
    if not stopwords:
        stopwords = nltk.corpus.stopwords.words('english')
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords]
    
    filtered_text = ' '.join(filtered_tokens) 
    return filtered_text

def remove_common_word(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in tokens]
    common_words = ['hey', 'hello', 'hi']
    filtered_tokens = [token for token in tokens if token.lower() not in common_words]
    filtered_text = ' '.join(filtered_tokens) 
    return filtered_text

def text_pre_processor(text, html_strip=True, accented_char_removal=True, contraction_expansion=True,
                       text_lower_case=False, special_char_removal=True, remove_digits=True, stopword_removal=False, 
                       stopword_list=None, text_stemming=False, text_lemmatization=False, remove_common_words=True):
    
    #remove urls
    text = re.sub(r'http\S+', '', text)
    
    # strip HTML
    if html_strip:
        text = strip_html_tags(text)
    
    # remove extra newlines (often might be present in really noisy text)
    text = text.translate(text.maketrans("\n\t\r", "   "))
    
    
    # remove accented characters
    if accented_char_removal:
        text = remove_accented_chars(text)
    
    # expand contractions    
    if contraction_expansion:
        text = expand_contractions(text)
        
        
    # remove special characters and\or digits    
    if special_char_removal:
        # insert spaces between special characters to isolate them    
        special_char_pattern = re.compile(r'([{.(-)!}])')
        text = special_char_pattern.sub(" \\1 ", text)
        text = remove_special_characters(text, remove_digits=remove_digits)
        
         
    # lowercase the text    
    if text_lower_case:
        text = text.lower()
        
        
    # remove stopwords
    if stopword_removal:
        text = remove_stopwords(text, is_lower_case=text_lower_case, 
                                stopwords=stopword_list)
    if remove_common_words:
      text = remove_common_word(text)
        
    # remove extra whitespace
    text = re.sub(' +', ' ', text)
    text = text.strip()
    
    return text

In [0]:
df['Clean_Title'] = df['Title'].apply(text_pre_processor)

In [0]:
print(df.loc[700,'Title'])
print(df.loc[700,'Clean_Title'])

Anyone know how to interpolate a string in VuePress markdown triple backlash code part?
Anyone know how to interpolate a string in VuePress markdown triple backlash code part


In [0]:
df['Clean_Body'] = df['Body'].apply(text_pre_processor)

In [0]:
print(df.loc[100,'Body'])
print('###################################')
print(df.loc[100,'Clean_Body'])

Simple, I hope, question. I have a small data lookup app I built for a client in Vue that I now need to secure a bit better then it was originally. I setup a simple login using AWS Amplify but when I build for prod my bundle with just the login in it is 3.5mb. is there any way to trim amplify down?
###################################
Simple I hope question I have a small data lookup app I built for a client in Vue that I now need to secure a bit better then it was originally I setup a simple login using AWS Amplify but when I build for prod my bundle with just the login in it is mb is there any way to trim amplify down


In [0]:
df.to_csv('/content/drive/My Drive/Colab Notebooks/clean_reddit_posts.csv', index=False, header=True)