# Text Preprocessing in NLP
Tokenize Text Columns Into Sentences

Required libraries
<br>
[pip install spacy](https://pypi.org/project/spacy/)
[conda install jupyter](
<br>
Input the following into gitbash: "python -m spacy download en_core_web_sm"

In [1]:
# Import Dependencies and setup
import pandas as pd
import os

In [2]:
# read csv output from Instagrapy_split_text.ipynb
df=pd.read_csv("../../resources/ig_datascrape_jc_2021-08-25.csv", encoding="ISO 8859-1")
df.head(2)

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,author,shortcode,timestamp,likes,comments,caption,text,Hash_tag2
0,0,0,shmee150,CSzoxcyrzj2,1629485286,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",Back at the wheel of an SF90! With @bannedauto...,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
1,1,1,shmee150,CSr2jQPjy59,1629224075,22143,100,"Photo shared by Tim - Shmee on August 17, 2021...",It's a P1 kinda day! Out for a drive in @super...,"[['McLaren'], ['P1'], ['McLarenP1'], ['testdri..."


## Data Cleaning Steps

### Punctuation Removal

In [3]:
# convert epoch time to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'],unit='s')

# force change, of specified column type, to string.
df.text = df.text.astype('string')
df.caption = df.caption.astype('string')
df.Hash_tag2 = df.Hash_tag2.astype('string')

df.dtypes  # verify string change

Unnamed: 0               int64
Unnamed: 0.1             int64
author                  object
shortcode               object
timestamp       datetime64[ns]
likes                    int64
comments                 int64
caption                 string
text                    string
Hash_tag2               string
dtype: object

In [4]:
# library that contains punctuation
# import string
# string.punctuation

The following script removes "@". Do we need to modify the script to keep it? If so, we will have to use Regex to more finely tune the punctuation removal.

In [5]:
# defining the function to remove punctuation
# def remove_punctuation(text):
#     punctuationfree="".join([i for i in text if i not in string.punctuation])
#     return punctuationfree

# storing the puntuation free text in a new column
# df['clean_txt']= df['text'].apply(lambda x: [remove_punctuation(str(x))])
# df.clean_txt = df.clean_txt.astype('string')
# df.head()

In [6]:
# count number of rows in DataFrame
number_of_rows = len(df)

number_of_rows

112

### Lowercase Text Manipulation

In [7]:
# storing all lower case text in a new column, "txt_lower". Note this leads to loss of
# information that a capital letter may convey, e.g. frustration or excitement.
# df['txt_lower']= df['clean_txt'].apply(lambda x: x.lower())

In [8]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,author,shortcode,timestamp,likes,comments,caption,text,Hash_tag2
0,0,0,shmee150,CSzoxcyrzj2,2021-08-20 18:48:06,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",Back at the wheel of an SF90! With @bannedauto...,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
1,1,1,shmee150,CSr2jQPjy59,2021-08-17 18:14:35,22143,100,"Photo shared by Tim - Shmee on August 17, 2021...",It's a P1 kinda day! Out for a drive in @super...,"[['McLaren'], ['P1'], ['McLarenP1'], ['testdri..."
2,2,2,shmee150,CSpWzdxJIIV,2021-08-16 18:58:41,21606,130,"Photo shared by Tim - Shmee on August 16, 2021...",The beautiful 300SL Roadster is without a shad...,"[['Mercedes'], ['300SL'], ['PebbleBeach'], ['C..."
3,3,3,shmee150,CSkdxyJAk2n,2021-08-14 21:23:26,30069,113,"Photo shared by Tim - Shmee on August 14, 2021...",The breathtaking @bugatti Bolide at @thequaile...,"[['Bugatti'], ['Bolide'], ['Quail'], ['CarWeek..."
4,4,4,shmee150,CSfHSEzi3BD,2021-08-12 19:30:39,34073,140,"Photo shared by Tim - Shmee on August 12, 2021...",The new @astonmartinlagonda Valkyrie Spider ha...,"[['AstonMartin'], ['Valkyrie'], ['ValkyrieSpid..."


In [9]:
# drop Unnamed columns
df =df.drop(['Unnamed: 0.1'], axis=1)
df = df.reset_index(drop=True)

# verify above scripts work. assign first_text to first row's "txt_lower" column
# all punctuations now removed, and words in lower case
ig_text = df.loc[0, "text"]
print(ig_text)

Back at the wheel of an SF90! With @bannedauto and @philwilson I'm checking out this stunning Assetto Fiorano car and thinking about the final spec I'll opt for mine, which actually needs to be locked next month. I'm also delighted to say that again the car has impressed me, I think it's one of the very best supercars currently on the market, mixing insane performance with new technology in such a seamless way. Needless to say, I'm quite excited about it! #Ferrari #SF90 #futureshmeemobile #AssettoFiorano #BannedAuto #LAcars #Shmee150


In [10]:
# verify "Unammed 0.1" was dropped
df.head(2)

Unnamed: 0.1,Unnamed: 0,author,shortcode,timestamp,likes,comments,caption,text,Hash_tag2
0,0,shmee150,CSzoxcyrzj2,2021-08-20 18:48:06,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",Back at the wheel of an SF90! With @bannedauto...,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
1,1,shmee150,CSr2jQPjy59,2021-08-17 18:14:35,22143,100,"Photo shared by Tim - Shmee on August 17, 2021...",It's a P1 kinda day! Out for a drive in @super...,"[['McLaren'], ['P1'], ['McLarenP1'], ['testdri..."


### Tokenization

Resources to better understand text preprocessing
<br>
[Tokenize Text Columns Into Sentences in Pandas](https://towardsdatascience.com/tokenize-text-columns-into-sentences-in-pandas-2c08bc1ca790)
<br>
Note that v3 of spacy replaces "nlp.create_pipe", with "nlp.add_pipe('sentencizer')"

In [11]:
# required library and a spacy model un-comment and run if not already installed

# !pip install spacy
# !python -m spacy download en_core_web_sm

In [12]:
# Test. Tokenize using spaCy
import spacy

# nlp = spacy.load("en_core_web_sm")
# [sent.text for sent in nlp(ig_text).sents]

In [13]:
from spacy.lang.en import English

nlp = English()  # just the language with no model
sentencizer = nlp.add_pipe('sentencizer')

In [14]:
[sent.text for sent in nlp(ig_text).sents]

# END Test

['Back at the wheel of an SF90!',
 "With @bannedauto and @philwilson I'm checking out this stunning Assetto Fiorano car and thinking about the final spec I'll opt for mine, which actually needs to be locked next month.",
 "I'm also delighted to say that again the car has impressed me, I think it's one of the very best supercars currently on the market, mixing insane performance with new technology in such a seamless way.",
 "Needless to say, I'm quite excited about it! #",
 'Ferrari #SF90 #futureshmeemobile #AssettoFiorano #BannedAuto #LAcars #Shmee150']

In [15]:
# tokenize all data, in column "text", using lambda function
# this was a pain. some elements were ints or floats, causing mixed returns of a dtype 
# object type. This stopped the script from filtering it out, returning a "nlp object 
# of type 'float' has no len()". the workaround is to turn everything into a string

nlp = spacy.load("en_core_web_sm")
df["text"] = df["text"].apply(lambda x: [sent.text for sent in (nlp(str(x)).sents)])


In [16]:
# convert list of sentences to one sentence for each row

df = df.explode("text")
df.reset_index(drop=True)
df.head(15)


Unnamed: 0.1,Unnamed: 0,author,shortcode,timestamp,likes,comments,caption,text,Hash_tag2
0,0,shmee150,CSzoxcyrzj2,2021-08-20 18:48:06,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",Back at the wheel of an SF90!,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
0,0,shmee150,CSzoxcyrzj2,2021-08-20 18:48:06,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",With @bannedauto and @philwilson I'm checking ...,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
0,0,shmee150,CSzoxcyrzj2,2021-08-20 18:48:06,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",I'm also delighted to say that again the car h...,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
0,0,shmee150,CSzoxcyrzj2,2021-08-20 18:48:06,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...","Needless to say, I'm quite excited about it!","[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
0,0,shmee150,CSzoxcyrzj2,2021-08-20 18:48:06,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",#Ferrari #SF90 #futureshmeemobile #AssettoFior...,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
1,1,shmee150,CSr2jQPjy59,2021-08-17 18:14:35,22143,100,"Photo shared by Tim - Shmee on August 17, 2021...",It's a P1 kinda day!,"[['McLaren'], ['P1'], ['McLarenP1'], ['testdri..."
1,1,shmee150,CSr2jQPjy59,2021-08-17 18:14:35,22143,100,"Photo shared by Tim - Shmee on August 17, 2021...",Out for a drive in @supercarsteven's newly pur...,"[['McLaren'], ['P1'], ['McLarenP1'], ['testdri..."
1,1,shmee150,CSr2jQPjy59,2021-08-17 18:14:35,22143,100,"Photo shared by Tim - Shmee on August 17, 2021...","Does make me wonder about P1 vs Senna, two ver...","[['McLaren'], ['P1'], ['McLarenP1'], ['testdri..."
1,1,shmee150,CSr2jQPjy59,2021-08-17 18:14:35,22143,100,"Photo shared by Tim - Shmee on August 17, 2021...",#McLaren #P1 #McLarenP1 #testdrive #CarWeek #S...,"[['McLaren'], ['P1'], ['McLarenP1'], ['testdri..."
2,2,shmee150,CSpWzdxJIIV,2021-08-16 18:58:41,21606,130,"Photo shared by Tim - Shmee on August 16, 2021...",The beautiful 300SL Roadster is without a shad...,"[['Mercedes'], ['300SL'], ['PebbleBeach'], ['C..."


In [17]:
df.rename(columns={"Unnamed: 0": "Dialogue ID"}, inplace=True)
df.index.name = "Sentence ID"

df.head(2)

Unnamed: 0_level_0,Dialogue ID,author,shortcode,timestamp,likes,comments,caption,text,Hash_tag2
Sentence ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,0,shmee150,CSzoxcyrzj2,2021-08-20 18:48:06,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",Back at the wheel of an SF90!,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."
0,0,shmee150,CSzoxcyrzj2,2021-08-20 18:48:06,19080,49,"Photo shared by Tim - Shmee on August 20, 2021...",With @bannedauto and @philwilson I'm checking ...,"[['Ferrari'], ['SF90'], ['futureshmeemobile'],..."


In [18]:
df.to_csv("../../resources/processed_ig_text_jc_2021-08-26.csv")

Need to remove "," , "-", "@", "#",  convert conjugations into full words, e.g. isn't.