# Natural Language Processing | NLP 

### Basics 

`Read` Text Data

In [1]:
import pandas as pd
smsCorpus = pd.read_csv('../Data/SMSSpamCollection.tsv', 
                        header=None, 
                        delimiter='\t',
                        names=['Label','SMS'])
smsCorpus.head()

Unnamed: 0,Label,SMS
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


`Explore` Dataset

In [2]:
print(f'Shape : {smsCorpus.shape}')

Shape : (5568, 2)


In [3]:
print(f'Unique Occurence : \n{smsCorpus["Label"].value_counts()}')

Unique Occurence : 
ham     4822
spam     746
Name: Label, dtype: int64


In [4]:
print(f'Any Missing Data ? \n{smsCorpus.isna().sum()}')

Any Missing Data ? 
Label    0
SMS      0
dtype: int64


### Regular Expression

`findall()` : Find for Exact words while ignoring the things that Seperates the Words.

`split()` : Search for Character that Splits the Words while ignoring Actual Words.

1. Text String for Describing a `Search Patterns`.

2. Identify `Whitespace` between Words | Tokens.

3. Identify | Create `Delimiters` or End of Line Escape Characters.

4. Remove `Punctuations` or `Numbers` from Text.

5. Clean HTML `Tags` from Text.

6. Identify Text `Patterns`.

`Use Case`

1. Confirm `Passwords` Criteria.

2. Search `URL` from Substring.

3. Search `Files` on Computer.

4. Document Scraping.

In [5]:
import re

text = 'My Name is Kirankumar Yadav I am a Data Scientist'
messy_text = 'This     is a String made up      of Messy Strings'
messy_str = 'This-is-a-messy-string/madeup.String.to>>>test---2""""regex-methods'

`Split` a Sentence into List of Words `Tokens` 

`\s` : Single Whitespace

In [6]:
re.split('\s', text) 

['My',
 'Name',
 'is',
 'Kirankumar',
 'Yadav',
 'I',
 'am',
 'a',
 'Data',
 'Scientist']

`\S+` : Find all Non White Space

In [7]:
re.findall('\S+', text) 

['My',
 'Name',
 'is',
 'Kirankumar',
 'Yadav',
 'I',
 'am',
 'a',
 'Data',
 'Scientist']

`\s+` One or More White Spaces

In [8]:
re.split('\s+', messy_text) 

['This', 'is', 'a', 'String', 'made', 'up', 'of', 'Messy', 'Strings']

`\S+` Find All Non White Spaces

In [9]:
re.findall('\S+', messy_text) 

['This', 'is', 'a', 'String', 'made', 'up', 'of', 'Messy', 'Strings']

`\W+`: Search for All Non Word Character

In [10]:
re.split('\W+', messy_str)

['This',
 'is',
 'a',
 'messy',
 'string',
 'madeup',
 'String',
 'to',
 'test',
 '2',
 'regex',
 'methods']

`\w+`: Find All Word Characters

In [11]:
re.findall('\w+', messy_str)

['This',
 'is',
 'a',
 'messy',
 'string',
 'madeup',
 'String',
 'to',
 'test',
 '2',
 'regex',
 'methods']

Regular Expression `Replacements`

1. re.`search()`

2. re.`match()`

3. re.`fullmatch()`

4. re.`finditer()`

5. re.`escape()`

In [12]:
pep8_test = 'I follow PEP8 Guidelines'
pep7_test = 'I follow PEP7 Guidelines'
peep8_test = 'I follow PEEP8 Guidelines'

In [13]:
re.findall('[a-z]+', pep8_test)

['follow', 'uidelines']

In [14]:
re.findall('[a-zA-Z]+', pep8_test)

['I', 'follow', 'PEP', 'Guidelines']

In [15]:
re.findall('[a-zA-Z0-9]+', pep8_test)

['I', 'follow', 'PEP8', 'Guidelines']

In [16]:
re.findall('[A-Z]+[0-9]+', pep8_test)

['PEP8']

re`sub()` : Subtitute 

In [17]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Style', pep7_test)

'I follow PEP8 Python Style Guidelines'