# NLP Tasks with training dataset

The following code is used to analyze the training dataset with NLP tasks.

## Import

Imports needed for the following code.

In [6]:
import nltk
import pandas

## Data Preprocessing

The following code is used to preprocess the training dataset.
Verify the data is in the correct format. And then, pass the text column values to a string.

In [7]:
# read data/cleaned/out.csv to a dataframe
df = pandas.read_csv('../data/cleaned/out.csv')
# get the number of rows and columns of the dataframe
df.shape

(65989, 2)

In [8]:
# get the type of each column
df.dtypes

label     int64
text     object
dtype: object

In [9]:
# found any non integer value in the dataframe column 'label'
df['label'].unique()

array([0, 1, 6, 4, 5, 2, 3], dtype=int64)

In [10]:
# get 'text' column values into a string
text = '.'.join(df['text'].to_list())
type(text)

str

## Frequency Distribution of Words

The following code is used to get the frequency distribution of words in the training dataset.

In [11]:
# Frequency distribution of words in the text
fdist = nltk.FreqDist(text.split())
len(fdist)

118141

In [12]:
vocab = fdist.keys()
list(vocab)[:10]

['@tiffanylue',
 'i',
 'know',
 'was',
 'listenin',
 'to',
 'bad',
 'habit',
 'earlier',
 'and']

### Frequency Distribution of Words with Length > 5 and Frequency > 100

In [13]:
freqwords = [(w,fdist[w]) for w in vocab if len(w) > 5 and fdist[w] > 100]
display(len(freqwords))
# order by frequency in descending order
freqwords.sort(key=lambda x: x[1], reverse=True)
freqwords[:10]

194

[('feeling', 5343),
 ('really', 2205),
 ('because', 1576),
 ('little', 1249),
 ('people', 1047),
 ('should', 950),
 ('myself', 853),
 ('something', 813),
 ('always', 709),
 ('getting', 692)]

### Check if the most frequent words are similar.

In [16]:
# found words that contains other words in the most frequent words and save it to a list
words = []
for i in range(len(freqwords)):
    for j in range(i+1, len(freqwords)):
        if freqwords[i][0] in freqwords[j][0]:
            words.append((freqwords[i][0], freqwords[j][0]))
            break

In [17]:
# show the words size
len(words)

4

In [18]:
#show the words
words

[('feeling', 'feelings'),
 ('myself', 'myself.i'),
 ('follow', 'following'),
 ('today.', 'today.i')]