[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infinite-Joy/natural_language_processing_for_professionals/blob/master/notebooks/chapter_3_string_processing_in_python.ipynb)

## The `str` type.

In [None]:
text = "quick brown fox jumps over the lazy dog."
print(type(text))

<class 'str'>


## Working with Unicode

In [None]:
text = "El Niño".encode("utf-8")
print('original text: ', text)
print('text after decoding: ', text.decode("utf-8"))

original text:  b'El Ni\xc3\xb1o'
text after decoding:  El Niño


### Size of different strings

In [None]:
import sys
string = 'hello'
print(sys.getsizeof(string))

# one byte encoding
print(sys.getsizeof(string+'!')-sys.getsizeof(string))

# 2-byte encoding
string2  = 'অ'
print(sys.getsizeof(string2+'অ')-sys.getsizeof(string2))
print(sys.getsizeof(string2))

# 4-byte encoding
string3 = '🐍'
print(sys.getsizeof(string3+'💻')-sys.getsizeof(string3))
print(sys.getsizeof(string3))

54
1
2
76
4
80


## Common Python string methods

### Character lengths and word lengths

In [None]:
string = 'natural language processing for professionals'
print('character length of the sentence:', len(string))

words = string.split()
print('word length of the sentence:', len(words))

character length of the sentence: 45
word length of the sentence: 5


### Character frequency

In [None]:
from collections import Counter

print(Counter('natural language processing for professionals'))

Counter({'a': 5, 's': 5, 'n': 4, 'r': 4, ' ': 4, 'o': 4, 'l': 3, 'g': 3, 'e': 3, 'u': 2, 'p': 2, 'i': 2, 'f': 2, 't': 1, 'c': 1})


### Pattern Search

In [None]:
spam_string = "click on http://spam.com"

print('http' in spam_string)

True


In [None]:
import re

def url_present(text):
    pattern = 'http.*\.com'
    if re.search(pattern, spam_string):
        return True
    else:
        return False

spam_string = "click on http://spam.com"
print(url_present(spam_string))

spam_string = "update on ticket"
print(url_present(spam_string))

True
False


### Strip whitespace

In [None]:
string_with_whitespace = "    I am without whitespace. \n"
string_without_whitespace = "I am without whitespace."
print('Equality of the two strings: ',
  string_without_whitespace==string_with_whitespace)
print('Equality of the strings after performing strip on string_without_whitespace: ',
  string_with_whitespace.strip()==string_without_whitespace)

Equality of the two strings:  False
Equality of the strings after performing strip on string_without_whitespace:  True


### Splitting Strings

In [None]:
document = "learning natural language processing"
print(document.split())

['learning', 'natural', 'language', 'processing']


### Joining list elements in a contiguous string

In [None]:
tokens = ['learning', 'natural', 'language', 'processing']
print(" ".join(tokens))

learning natural language processing


### Case of the string

In [None]:
string = "EDUCATIVE"
print('unique object identification of `string`', id(string))

lower_case = string.lower()
print('sentence in lower case: ', lower_case)
print('unique object identification of `lower_case`', id(lower_case))

upper_case = lower_case.upper()
print('sentence in upper case: ', upper_case)
print('unique object identification of `upper_case`', id(upper_case))

print('is a match with the original text: ', upper_case == string)

unique object identification of `string` 140289493087984
sentence in lower case:  educative
unique object identification of `lower_case` 140289493090224
sentence in upper case:  EDUCATIVE
unique object identification of `upper_case` 140289493086704
is a match with the original text:  True


## Pandas

In [None]:
import pandas as pd
s1 = pd.Series(
    ['string processing using vanila python',
     'string processing in pandas']
)
print(s1)
print()
print('converting all the strings to uppercase')
print(s1.str.upper())
print()
print('converting all the strings to lowercase')
print(s1.str.lower())
print()
print('split all the sentences to words')
print(s1.str.strip().str.split())

0    string processing using vanila python
1              string processing in pandas
dtype: object

converting all the strings to uppercase
0    STRING PROCESSING USING VANILA PYTHON
1              STRING PROCESSING IN PANDAS
dtype: object

converting all the strings to lowercase
0    string processing using vanila python
1              string processing in pandas
dtype: object

split all the sentences to words
0    [string, processing, using, vanila, python]
1               [string, processing, in, pandas]
dtype: object


## Amazon reviews dataset

Download the amazon reviews dataset from github.

Original source: https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/

In [None]:
!wget https://github.com/infinite-Joy/natural_language_processing_for_professionals/raw/main/data/Video_Games.json.gz

--2023-05-07 16:39:23--  https://github.com/infinite-Joy/natural_language_processing_for_professionals/raw/main/data/Video_Games.json.gz
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/infinite-Joy/natural_language_processing_for_professionals/main/data/Video_Games.json.gz [following]
--2023-05-07 16:39:23--  https://media.githubusercontent.com/media/infinite-Joy/natural_language_processing_for_professionals/main/data/Video_Games.json.gz
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 522823613 (499M) [application/octet-stream]
Saving to: ‘Video_Games.json.gz’


2023-05-07 16:39:

In [None]:
!ls -ltr

total 510576
drwxr-xr-x 1 root root      4096 May  3 13:31 sample_data
-rw-r--r-- 1 root root 522823613 May  7 16:39 Video_Games.json.gz


In [None]:
import gzip
import json
import pandas as pd
import string

def parse(path):
    g = gzip.open(path, 'rb')
    for l in g:
        yield json.loads(l)

def getDF(path):
    df = {}
    i = 0
    for d in parse(path):
        df[i] = d
        i += 1
    return pd.DataFrame.from_dict(df, orient='index')


df = getDF('./Video_Games.json.gz')
df = df[['reviewText', 'overall']]
print(df.shape)

In [None]:
import string

# map punctuation to space
translator = str.maketrans(string.punctuation, ' '*len(string.punctuation)) 

def text_preprocessing(df):
    """
    Preprocess the text for better understanding
    
    """
    # trim the whitespace at the edges of the string
    df['reviewText'] = df['reviewText'].str.strip()

    # lowercase the text in the string
    df['reviewText'] = df['reviewText'].str.lower()

    # remove the punctualtion in the string.
    df['reviewText'] = df['reviewText'].apply(lambda text: text.translate(translator))

    return df


df = df.dropna()
df = df.drop_duplicates()
df = text_preprocessing(df)

In [None]:
df.head()

Unnamed: 0,reviewText,overall
0,i used to play this game years ago and loved i...,1.0
1,the game itself worked great but the story lin...,3.0
2,i had to learn the hard way after ordering thi...,4.0
3,the product description should state this clea...,1.0
4,i would recommend this learning game for anyon...,4.0
