# Develop a text preprocessing pipeline. The pipeline will take the raw text as input, clean it, transform it, and extract the basic features of textual content

## Steps in pipeline for text processing :-

**source text -> identify noise -> noise removal -> character normalization ->data masking ==> CLEAN TEXT**

**CLEAN TEXT -> tokenization -> POS tagging -> Lemmatization -> Named entity recognition ==> PREPARED TEXT**

1. Noise removal = start by identifying and removing noise in text like HTML tags and nonprintable characters. 
2. character normalization = special characters such as accents and hyphens are transformed into a standard representation.
3. data mask or remove identifiers = like URLs or email addresses if they are not relevant for the analysis or if there are privacy issues.
4. tokenization = splits a document into a list of separate tokens like words and punctuation characters.
5. Part-of-speech (POS) tagging = is the process of determining the word class, whether it’s a noun, a verb, an article.
6. Lemmatization = it maps inflected words to their uninflected root, the lemma
7. named-entity recognition = is the identification of references to people, organizations, locations, etc., in the text.

In [1]:
# load the dataset

In [2]:
import pandas as pd

posts_df = pd.read_csv('C:/Users/0023ND744/Desktop/my_notebooks/text_analytics/rspct_autos.tsv.gz', sep='\t')

# subred_file = "subreddit_info.csv.gz"
# subred_file = f"{BASE_DIR}/data/reddit-selfposts/subreddit_info.csv.gz" ### real location
# subred_df = pd.read_csv(subred_file).set_index(['subreddit'])

# df = posts_df.join(subred_df, on='subreddit')
# len(df) ###

In [3]:
posts_df

Unnamed: 0,id,subreddit,title,selftext
0,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...
1,5s0q8r,Mustang,Roush vs Shleby GT500,"I am trying to determine which is faster, and ..."
2,5z3405,Volkswagen,2001 Golf Wagon looking for some insight,Hello! <lb><lb>Trying to find some information...
3,7df18v,Lexus,IS 250 Coolant Flush/Change,https://www.cars.com/articles/how-often-should...
4,5tpve8,volt,Gen1 mpg w/ dead battery?,"Hi, new to this subreddit. I'm considering bu..."
...,...,...,...,...
19995,7i2k6y,4Runner,Bilstein Shocks,I read a lot Forums and people recommend getti...
19996,83p2kv,Harley,Question on potential purchase of crashed bike.,I am thinking about buying a 2010 Harley Spor...
19997,7x722h,volt,Got our first warning light on our dash,My husband and I were headed somewhere and I w...
19998,7v2xmg,Lexus,Any IS models to avoid?,I am looking at getting a used Lexus IS (2014 ...


In [4]:
subred_df = pd.read_csv('C:/Users/0023ND744/Desktop/my_notebooks/text_analytics/subreddit_info.csv.gz').set_index(['subreddit'])

In [5]:
subred_df

Unnamed: 0_level_0,category_1,category_2,category_3,in_data,reason_for_exclusion
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
whatsthatbook,advice/question,book,,True,
CasualConversation,advice/question,broad,,False,too_broad
Clairvoyantreadings,advice/question,broad,,False,too_broad
DecidingToBeBetter,advice/question,broad,,False,too_broad
HelpMeFind,advice/question,broad,,False,too_broad
...,...,...,...,...,...
HFY,writing/stories,sci-fi,,True,
TalesFromYourServer,writing/stories,tech support,,False,fewer posts than r/talesfromtechsupport which ...
talesfromtechsupport,writing/stories,tech support,,True,
WayfarersPub,writing/stories,wayfarers pub,,True,


In [6]:
df = posts_df.join(subred_df, on='subreddit')

In [7]:
len(df)

20000

In [8]:
df

Unnamed: 0,id,subreddit,title,selftext,category_1,category_2,category_3,in_data,reason_for_exclusion
0,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...,autos,harley davidson,,True,
1,5s0q8r,Mustang,Roush vs Shleby GT500,"I am trying to determine which is faster, and ...",autos,ford,,True,
2,5z3405,Volkswagen,2001 Golf Wagon looking for some insight,Hello! <lb><lb>Trying to find some information...,autos,VW,,True,
3,7df18v,Lexus,IS 250 Coolant Flush/Change,https://www.cars.com/articles/how-often-should...,autos,lexus,,True,
4,5tpve8,volt,Gen1 mpg w/ dead battery?,"Hi, new to this subreddit. I'm considering bu...",autos,chevrolet,,True,
...,...,...,...,...,...,...,...,...,...
19995,7i2k6y,4Runner,Bilstein Shocks,I read a lot Forums and people recommend getti...,autos,toyota,,True,
19996,83p2kv,Harley,Question on potential purchase of crashed bike.,I am thinking about buying a 2010 Harley Spor...,autos,harley davidson,,True,
19997,7x722h,volt,Got our first warning light on our dash,My husband and I were headed somewhere and I w...,autos,chevrolet,,True,
19998,7v2xmg,Lexus,Any IS models to avoid?,I am looking at getting a used Lexus IS (2014 ...,autos,lexus,,True,


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   id                    20000 non-null  object
 1   subreddit             20000 non-null  object
 2   title                 20000 non-null  object
 3   selftext              20000 non-null  object
 4   category_1            20000 non-null  object
 5   category_2            20000 non-null  object
 6   category_3            0 non-null      object
 7   in_data               20000 non-null  bool  
 8   reason_for_exclusion  0 non-null      object
dtypes: bool(1), object(8)
memory usage: 1.2+ MB


In [10]:
df.columns

Index(['id', 'subreddit', 'title', 'selftext', 'category_1', 'category_2',
       'category_3', 'in_data', 'reason_for_exclusion'],
      dtype='object')

In [11]:
df.category_3.isnull().sum()

20000

In [12]:
# column renaming for better understanding data

column_mapping = {
    'id': 'id',
    'subreddit': 'subreddit',
    'title': 'title',
    'selftext': 'text',
    'category_1': 'category',
    'category_2': 'subcategory',  
    'category_3': None, # no data
    'in_data': None, # not needed
    'reason_for_exclusion': None # not needed
}

# define remaining columns
columns = [c for c in column_mapping.keys() if column_mapping[c] != None]

# select and rename those columns
df = df[columns].rename(columns=column_mapping)

In [13]:
df

Unnamed: 0,id,subreddit,title,text,category,subcategory
0,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...,autos,harley davidson
1,5s0q8r,Mustang,Roush vs Shleby GT500,"I am trying to determine which is faster, and ...",autos,ford
2,5z3405,Volkswagen,2001 Golf Wagon looking for some insight,Hello! <lb><lb>Trying to find some information...,autos,VW
3,7df18v,Lexus,IS 250 Coolant Flush/Change,https://www.cars.com/articles/how-often-should...,autos,lexus
4,5tpve8,volt,Gen1 mpg w/ dead battery?,"Hi, new to this subreddit. I'm considering bu...",autos,chevrolet
...,...,...,...,...,...,...
19995,7i2k6y,4Runner,Bilstein Shocks,I read a lot Forums and people recommend getti...,autos,toyota
19996,83p2kv,Harley,Question on potential purchase of crashed bike.,I am thinking about buying a 2010 Harley Spor...,autos,harley davidson
19997,7x722h,volt,Got our first warning light on our dash,My husband and I were headed somewhere and I w...,autos,chevrolet
19998,7v2xmg,Lexus,Any IS models to avoid?,I am looking at getting a used Lexus IS (2014 ...,autos,lexus


In [14]:
# all categories i.e category_1 and category_2 is gone now and merged to category = autos
df = df[df['category'] == 'autos']

In [15]:
df

Unnamed: 0,id,subreddit,title,text,category,subcategory
0,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...,autos,harley davidson
1,5s0q8r,Mustang,Roush vs Shleby GT500,"I am trying to determine which is faster, and ...",autos,ford
2,5z3405,Volkswagen,2001 Golf Wagon looking for some insight,Hello! <lb><lb>Trying to find some information...,autos,VW
3,7df18v,Lexus,IS 250 Coolant Flush/Change,https://www.cars.com/articles/how-often-should...,autos,lexus
4,5tpve8,volt,Gen1 mpg w/ dead battery?,"Hi, new to this subreddit. I'm considering bu...",autos,chevrolet
...,...,...,...,...,...,...
19995,7i2k6y,4Runner,Bilstein Shocks,I read a lot Forums and people recommend getti...,autos,toyota
19996,83p2kv,Harley,Question on potential purchase of crashed bike.,I am thinking about buying a 2010 Harley Spor...,autos,harley davidson
19997,7x722h,volt,Got our first warning light on our dash,My husband and I were headed somewhere and I w...,autos,chevrolet
19998,7v2xmg,Lexus,Any IS models to avoid?,I am looking at getting a used Lexus IS (2014 ...,autos,lexus


In [16]:
df.sample(2)

Unnamed: 0,id,subreddit,title,text,category,subcategory
16279,8gdmyv,Hyundai,"Hyundai Matrix (2008, UK) - Unsure how to remo...","As per the title, really. I'd like to swap out...",autos,hyundai
17169,5ho005,Mustang,Frozen fuel line?,"I own a 2001 convertible with 180,000 miles. C...",autos,ford


In [17]:
# saving dataset

df.to_pickle("reddit_dataframe.pkl")

## Cleaning Text Data

### Step 1.a. : Removal of Noise

The following function can help you to identify noise in textual data. By noise we mean everything that’s not plain text and may therefore disturb further analysis. The function uses a regular expression to search for a number of suspicious characters and returns their share of all characters as a score for impurity. Very short texts (less than min_len characters) are ignored because here a single special character would lead to a significant impurity and distort the result.

In [18]:
text = """
After viewing the [PINKIEPOOL Trailer](https://www.youtu.be/watch?v=ieHRoHUg)
it got me thinking about the best match ups.
<lb>Here's my take:<lb><lb>[](/sp)[](/ppseesyou) Deadpool<lb>[](/sp)[](/ajsly)
Captain America<lb>"""

In [19]:

import re

RE_SUSPICIOUS = re.compile(r'[&#<>{}\[\]\\]')

def impurity(text, min_len=10):
    """returns the share of suspicious characters in a text"""
    if text == None or len(text) < min_len:
        return 0
    else:
        return len(RE_SUSPICIOUS.findall(text))/len(text)

print(impurity(text))

0.09009009009009009


For the previous example text, about 9% of the characters are “suspicious” 

In [20]:
# get  most “impure” records from the dataset with text column

df['impurity'] = df['text'].apply(impurity, min_len=10)

# get the top 10 records
df[['text', 'impurity']].sort_values(by='impurity', ascending=False).head(10)

Unnamed: 0,text,impurity
19682,Looking at buying a 335i with 39k miles and 11...,0.214716
12357,I'm looking to lease an a4 premium plus automa...,0.165099
2730,Breakdown below:<lb><lb>Elantra GT<lb><lb>2.0L...,0.13913
12754,Bulbs Needed:<lb><lb><lb>**194 LED BULB x8**<l...,0.132411
10726,I currently have a deposit on a 2013 335is (CP...,0.129317
11122,"Vehicle Price<tab><tab>$40,650.00<tab> <lb> <t...",0.119777
16895,TITLE IS WRONG IT'S A 2014<lb><lb>Interested i...,0.116208
2989,Looking into sport trucks and these are both i...,0.11157
1036,Subaru 1.5r aka Felicity<lb><lb>Fairly stock a...,0.111111
12303,I'm having a bit of difficulty deciding betwee...,0.108352


In [21]:
# check for tags like linebreaks 



### Step 1.b.  Removing Noise with Regular Expressions

In [22]:
import html

def clean(text):
    # convert html escapes like &amp; to characters.
    text = html.unescape(text)
    # tags like <tab>
    text = re.sub(r'<[^<>]*>', ' ', text)
    # markdown URLs like [Some text](https://....)
    text = re.sub(r'\[([^\[\]]*)\]\([^\(\)]*\)', r'\1', text)
    # text or code in brackets like [0]
    text = re.sub(r'\[[^\[\]]*\]', ' ', text)
    # standalone sequences of specials, matches &# but not #cool
    text = re.sub(r'(?:^|\s)[&#<>{}\[\]+|\\:-]{1,}(?:\s|$)', ' ', text)
    # standalone sequences of hyphens like --- or ==
    text = re.sub(r'(?:^|\s)[\-=\+]{2,}(?:\s|$)', ' ', text)
    # sequences of white spaces
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

In [23]:
# apply the function clean
clean_text = clean(text)
print(clean_text)
print("Impurity:", impurity(clean_text))

After viewing the PINKIEPOOL Trailer it got me thinking about the best match ups. Here's my take: Deadpool Captain America
Impurity: 0.0


In [24]:
# now for the complete dataset
df['clean_text'] = df['text'].map(clean)
df['impurity']   = df['clean_text'].apply(impurity, min_len=20)

In [25]:
df[['clean_text', 'impurity']].sort_values(by='impurity', ascending=False) \
                              .head(10)

Unnamed: 0,clean_text,impurity
14058,"Mustang 2018, 2019, or 2020? Must Haves!! 1. H...",0.030864
18934,"At the dealership, they offered an option for ...",0.026455
16505,"I am looking at four Caymans, all are in a sim...",0.024631
4376,"Hello, I came across this great looking e500 a...",0.022762
18729,The Mazda 3 hatchback I'm looking at is a 2007...,0.022099
15505,Not sure if this type of post is allowed but.....,0.021938
3255,I am in need of two front tires and instead of...,0.021792
11776,"Hi, I'm looking to buy an A45 AMG \(here in th...",0.021251
14935,"I heard these cars are money pits, but this on...",0.020927
19249,My car broke down this week driving up to Davi...,0.019274


In [26]:
# textacy = linguistic part to spaCy and focuses on pre- and postprocessing. 

In [27]:
text = "The café “Saint-Raphaël” is loca-\nted on Côte dʼAzur."

In [28]:
import textacy
import textacy.preprocessing as tprep

# def normalize(text):
#     text = tprep.normalize_hyphenated_words(text)
#     text = tprep.normalize_quotation_marks(text)
#     text = tprep.normalize_unicode(text)
#     text = tprep.remove_accents(text)
#     return text

# above works if textacy version is < 0.11


# for any version > 0.11 use below piece

def normalize(text):
    text = tprep.normalize.hyphenated_words(text)
    text = tprep.normalize.quotation_marks(text)
    text = tprep.normalize.unicode(text)
    text = tprep.remove.accents(text)
    return text

In [29]:
print(normalize(text))

The cafe "Saint-Raphael" is located on Cote d'Azur.


## Tokenization

In [30]:
# sample text 

text = """
2019-08-10 23:32: @pete/@louis - I don't have a well-designed
solution for today's problem. The code of module AC68 should be -1.
Have to think a bit... #goodnight ;-) 😩😬"""

### a ) Tokenization with Regular Expressions

functions for tokenization are **re.split()** and **re.findall()**. The first one splits a string at matching expressions, while the latter extracts all character sequences matching a certain pattern.

---

scikit-learn CountVectorizer uses the pattern \w\w+ for its default tokenization. It matches all sequences of two or more alphanumeric characters.

In [31]:
tokens = re.findall(r'\w\w+', text)
print(*tokens, sep='|')

2019|08|10|23|32|pete|louis|don|have|well|designed|solution|for|today|problem|The|code|of|module|AC68|should|be|Have|to|think|bit|goodnight


In [32]:
# but emojis are lost
# to retain them!

RE_TOKEN = re.compile(r"""
               ( [#]?[@\w'’\.\-\:]*\w     # words, hashtags and email addresses
               | [:;<]\-?[\)\(3]          # coarse pattern for basic text emojis
               | [\U0001F100-\U0001FFFF]  # coarse code range for unicode emojis
               )
               """, re.VERBOSE)

def tokenize(text):
    return RE_TOKEN.findall(text)

tokens = tokenize(text)
print(*tokens, sep='|')

2019-08-10|23:32|@pete|@louis|I|don't|have|a|well-designed|solution|for|today's|problem|The|code|of|module|AC68|should|be|-1|Have|to|think|a|bit|#goodnight|;-)|😩|😬


### b) Tokenization with NLTK

In [33]:
# use word_tokenize!
import nltk

tokens = nltk.tokenize.word_tokenize(text)
print(*tokens, sep='|')

2019-08-10|23:32|:|@|pete/|@|louis|-|I|do|n't|have|a|well-designed|solution|for|today|'s|problem|.|The|code|of|module|AC68|should|be|-1|.|Have|to|think|a|bit|...|#|goodnight|;|-|)|😩😬


## SPACY text processing

Flow is :
    
**Text -> tokenize -> POS tagging -> Parsing -> NER -> document**

The main object to represent the processed text is a Doc object, which itself contains a list of Token objects.

### Setting up a pipeline

 first step we need to instantiate an object of spaCy’s Language class by calling spacy.load() along with the name of the model file to use

In [35]:
import spacy #spacyV = 3.1.3
nlp = spacy.load('en_core_web_sm')

In [36]:
# to check pipeline components
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x21e2b891680>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x21e2b8a50e0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x21e2b8b60a0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x21e2b8d7780>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x21e2b8bf180>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x21e2b6b8fa0>)]

In [40]:
# for using onlyy tokenizer 
nlp.make_doc(text)


2019-08-10 23:32: @pete/@louis - I don't have a well-designed
solution for today's problem. The code of module AC68 should be -1.
Have to think a bit... #goodnight ;-) 😩😬

In [41]:
doc = nlp(text)

In [42]:
doc


2019-08-10 23:32: @pete/@louis - I don't have a well-designed
solution for today's problem. The code of module AC68 should be -1.
Have to think a bit... #goodnight ;-) 😩😬

In [45]:
# doc is container object for token
# meaning we can print out the tokens in the doc

for t in doc:
    print(t, end='*')


*2019*-*08*-*10*23:32*:*@pete/@louis*-*I*do*n't*have*a*well*-*designed*
*solution*for*today*'s*problem*.*The*code*of*module*AC68*should*be*-1*.*
*Have*to*think*a*bit*...*#*goodnight*;-)*😩*😬*

In [46]:
# each token belongs to class = token in spacy

# for a table with words and its POS

def display_nlp(doc, include_punct=False):
    """Generate data frame for visualization of spaCy tokens."""
    rows = []
    for i, t in enumerate(doc):
        if not t.is_punct or include_punct:
            row = {'token': i,  'text': t.text, 'lemma_': t.lemma_,
                   'is_stop': t.is_stop, 'is_alpha': t.is_alpha,
                   'pos_': t.pos_, 'dep_': t.dep_,
                   'ent_type_': t.ent_type_, 'ent_iob_': t.ent_iob_}
            rows.append(row)

    df = pd.DataFrame(rows).set_index('token')
    df.index.name = None
    return df

In [47]:
df

Unnamed: 0,id,subreddit,title,text,category,subcategory,impurity,clean_text
0,8f73s7,Harley,No Club Colors,Funny story. I went to college in Las Vegas. T...,autos,harley davidson,0.0,Funny story. I went to college in Las Vegas. T...
1,5s0q8r,Mustang,Roush vs Shleby GT500,"I am trying to determine which is faster, and ...",autos,ford,0.0,"I am trying to determine which is faster, and ..."
2,5z3405,Volkswagen,2001 Golf Wagon looking for some insight,Hello! <lb><lb>Trying to find some information...,autos,VW,0.0,Hello! Trying to find some information on repl...
3,7df18v,Lexus,IS 250 Coolant Flush/Change,https://www.cars.com/articles/how-often-should...,autos,lexus,0.0,https://www.cars.com/articles/how-often-should...
4,5tpve8,volt,Gen1 mpg w/ dead battery?,"Hi, new to this subreddit. I'm considering bu...",autos,chevrolet,0.0,"Hi, new to this subreddit. I'm considering buy..."
...,...,...,...,...,...,...,...,...
19995,7i2k6y,4Runner,Bilstein Shocks,I read a lot Forums and people recommend getti...,autos,toyota,0.0,I read a lot Forums and people recommend getti...
19996,83p2kv,Harley,Question on potential purchase of crashed bike.,I am thinking about buying a 2010 Harley Spor...,autos,harley davidson,0.0,I am thinking about buying a 2010 Harley Sport...
19997,7x722h,volt,Got our first warning light on our dash,My husband and I were headed somewhere and I w...,autos,chevrolet,0.0,My husband and I were headed somewhere and I w...
19998,7v2xmg,Lexus,Any IS models to avoid?,I am looking at getting a used Lexus IS (2014 ...,autos,lexus,0.0,I am looking at getting a used Lexus IS (2014 ...


# tokenization of spacy contains:

spaCy’s tokenizer is completely rule-based.
1. split the text on white space
2. use prefix , suffix , infix splitting rules defined by regex


In [48]:
# option is to merge tokens in a postprocessing step using doc.retokenize

### Extracting Lemmas Based on Part of Speech

In [51]:
text = "My best friend Ryan Peters likes fancy adventure games."
doc = nlp(text)

print(*[t.lemma_ for t in doc], sep='|')

my|good|friend|Ryan|Peters|like|fancy|adventure|game|.
