# Text Preprocessing in Natural Language Processing
Tokenize Text Columns Into Sentences

Required libraries
<br>
[pip install spacy](https://pypi.org/project/spacy/)
<br>
Input the following into gitbash: "python -m spacy download en_core_web_sm"

In [1]:
# Import Dependencies and setup
import pandas as pd
import os

In [2]:
# read csv output from Instagrapy_split_text.ipynb
df=pd.read_csv("../../resources/ig_datascrape_jc_2021-08-25.csv", encoding="ISO 8859-1")
df.head(2)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,author,shortcode,timestamp,likes,comments,caption,text,Hash_tag2
0,0,0,shmee150,CSzoxcyrzj2,1629485286,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",Back at the wheel of an SF90! With @bannedauto...,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
1,1,1,shmee150,CSr2jQPjy59,1629224075,22143,100,"Photo shared by Tim - Shmee on August 17, 2021...",It's a P1 kinda day! Out for a drive in @super...,"[['McLaren'], ['P1'], ['McLarenP1'], ['testdri..."


## Punctuation Removal

In [3]:
# library that contains punctuation
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

The following script removes "@". Do we need to modify the script to keep it? If so, we will have to use Regex to more finely tune the punctuation removal.

In [5]:
# defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree
#storing the puntuation free text
df['clean_txt']= df['text'].apply(lambda x: [remove_punctuation(str(x))])
df.head(1)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,author,shortcode,timestamp,likes,comments,caption,text,Hash_tag2,clean_txt
0,0,0,shmee150,CSzoxcyrzj2,1629485286,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",Back at the wheel of an SF90! With @bannedauto...,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],...",[Back at the wheel of an SF90 With bannedauto ...


In [6]:
# count number of rows in DataFrame
number_of_rows = len(df)

number_of_rows

112

## Lowercase Text Manipulation
PROBLEM. Lower only works on strings. Two options: call lower on each element using a for loop, or turn the list into a string, then call lower on it.
<br>
Psudeo code example:
<br>
<p> original_list = ['A','B','C']
<p> new_list = []
<p> for item in original_list:
<p> new_list.append(str.lower(item))
<p> print(new_list)

In [13]:
df['msg_lower']= df['clean_txt'].apply(lambda x: x.lower())

AttributeError: 'list' object has no attribute 'lower'

In [None]:
# drop Unnamed columns
df =df.drop(['Unnamed: 0.1'], axis=1)
df = df.reset_index(drop=True)

# convert epoch time to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')

# assign first_text to first row's "text" column
ig_text = df.loc[0, "text"]
print(ig_text)

In [None]:
df.head()

## Tokenization

Resources to better understand text preprocessing
<br>
[Tokenize Text Columns Into Sentences in Pandas](https://towardsdatascience.com/tokenize-text-columns-into-sentences-in-pandas-2c08bc1ca790)
<br>
Note that v3 of spacy replaces "nlp.create_pipe", with "nlp.add_pipe('sentencizer')"

In [None]:
# required library and a spacy model

# !pip install spacy
# !python -m spacy download en_core_web_sm

In [None]:
# Test. Tokenize using spaCy
import spacy

nlp = spacy.load("en_core_web_sm")
[sent.text for sent in nlp(ig_text).sents]

In [None]:
from spacy.lang.en import English

nlp = English()  # just the language with no model
sentencizer = nlp.add_pipe('sentencizer')

In [None]:
[sent.text for sent in nlp(ig_text).sents]

# END Test

In [None]:
# tokenize all data, in column "text", using lambda function
# this was a pain. some elements were ints or floats, causing mixed returns of a dtype 
# object type. This stopped the script from filtering it out, returning a "nlp object 
# of type 'float' has no len()". the workaround is to turn everything into a string

nlp = spacy.load("en_core_web_sm")
df["text"] = df["text"].apply(lambda x: [sent.text for sent in (nlp(str(x)).sents)])


In [None]:
# convert list of sentences to one sentence for each row

df = df.explode("text")
df.reset_index(drop=True)


In [None]:
df.rename(columns={"Unnamed: 0": "Dialogue ID"}, inplace=True)
df.index.name = "Sentence ID"

df

In [None]:
df.to_csv("../../resources/processed_ig_text_jc_2021-08-26.csv")

## To do

Need to remove "," , "-", "@", "#",  convert conjugations into full words, e.g. isn't.