# Text Preprocessing in NLP
Tokenize Text Columns Into Sentences

Required libraries
<br>
[pip install spacy](https://pypi.org/project/spacy/)
[conda install jupyter](
<br>
Input the following into gitbash: "python -m spacy download en_core_web_sm"

In [None]:
# Import Dependencies and setup
import pandas as pd
import os

In [None]:
# read csv output from Instagrapy_split_text.ipynb
df=pd.read_csv("../../resources/ig_datascrape_jc_2021-08-25.csv", encoding="ISO 8859-1")
df.head(2)

## Data Cleaning Steps

### Punctuation Removal

In [None]:
# convert epoch time to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')

# force change, of specified column type, to string.
df.text = df.text.astype('string')
df.caption = df.caption.astype('string')
df.Hash_tag2 = df.Hash_tag2.astype('string')

df.dtypes  # verify string change

In [None]:
# library that contains punctuation
import string
string.punctuation

The following script removes "@". Do we need to modify the script to keep it? If so, we will have to use Regex to more finely tune the punctuation removal.

In [None]:
# defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree

# storing the puntuation free text in a new column
df['clean_txt']= df['text'].apply(lambda x: [remove_punctuation(str(x))])
df.clean_txt = df.clean_txt.astype('string')
df.head()

In [None]:
# count number of rows in DataFrame
number_of_rows = len(df)

number_of_rows

### Lowercase Text Manipulation

In [None]:
# storing all lower case text in a new column, "txt_lower". Note this leads to loss of
# information that a capital letter may convey, e.g. frustration or excitement.
df['txt_lower']= df['clean_txt'].apply(lambda x: x.lower())

In [None]:
df.head()

In [None]:
# drop Unnamed columns
df =df.drop(['Unnamed: 0.1'], axis=1)
df = df.reset_index(drop=True)

# verify above scripts work. assign first_text to first row's "txt_lower" column
# all punctuations now removed, and words in lower case
ig_text = df.loc[0, "txt_lower"]
print(ig_text)

In [None]:
# verify "Unammed 0.1" was dropped
df.head(2)

### Tokenization

Resources to better understand text preprocessing
<br>
[Tokenize Text Columns Into Sentences in Pandas](https://towardsdatascience.com/tokenize-text-columns-into-sentences-in-pandas-2c08bc1ca790)
<br>
Note that v3 of spacy replaces "nlp.create_pipe", with "nlp.add_pipe('sentencizer')"

In [None]:
# required library and a spacy model un-comment and run if not already installed

!pip install spacy
!python -m spacy download en_core_web_sm

In [None]:
# Test. Tokenize using spaCy
import spacy

# nlp = spacy.load("en_core_web_sm")
# [sent.text for sent in nlp(ig_text).sents]

In [None]:
from spacy.lang.en import English

nlp = English()  # just the language with no model
sentencizer = nlp.add_pipe('sentencizer')

In [None]:
[sent.text for sent in nlp(ig_text).sents]

# END Test

In [None]:
# tokenize all data, in column "text", using lambda function
# this was a pain. some elements were ints or floats, causing mixed returns of a dtype 
# object type. This stopped the script from filtering it out, returning a "nlp object 
# of type 'float' has no len()". the workaround is to turn everything into a string

nlp = spacy.load("en_core_web_sm")
df["txt_lower"] = df["txt_lower"].apply(lambda x: [sent.text for sent in (nlp(str(x)).sents)])


In [None]:
# convert list of sentences to one sentence for each row

df = df.explode("txt_lower")
df.reset_index(drop=True)
df.head(15)


In [None]:
df.rename(columns={"Unnamed: 0": "Dialogue ID"}, inplace=True)
df.index.name = "Sentence ID"

df.head(2)

In [None]:
df.to_csv("../../resources/processed_ig_text_jc_2021-08-26.csv")

Need to remove "," , "-", "@", "#",  convert conjugations into full words, e.g. isn't.