In this class will cover the following: 

1) Basic text data exploration
2) Basic text cleaning procesess like converting the words into lower case, remove stopwords, special characters, puncnt. etc.
3) Create a DTM and other concepts of DTM like sparsity etc.
4) Beg of word to find the key words from the tweets

In [1]:
# Lets import the libraries

import pandas as pd
import string
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from matplotlib import pyplot as plt

# Lets import our data

df = pd.read_csv("/Users/amitchoudhary/Downloads/tweet_data.csv")

df.head()



Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


Lets do some basic data exploring: 

1) Number of words in each tweet
2) Number of characters in each tweet
3) Averge words length in each tweet
4) Number of stop words in each tweets

In [2]:
# Number of words in each tweet

# Lets create the lambda function to find the word count

fun_words = lambda x : len(str(x).split(" "))

df['word_count'] = df['tweet'].apply(fun_words)

df.head()

Unnamed: 0,id,label,tweet,word_count
0,1,0,@user when a father is dysfunctional and is s...,21
1,2,0,@user @user thanks for #lyft credit i can't us...,22
2,3,0,bihday your majesty,5
3,4,0,#model i love u take with u all the time in ...,17
4,5,0,factsguide: society now #motivation,8


In [3]:
# Number of characters in each tweet

df['char_count'] = df['tweet'].str.len()

df.head()

# Space will also be counted

Unnamed: 0,id,label,tweet,word_count,char_count
0,1,0,@user when a father is dysfunctional and is s...,21,102
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122
2,3,0,bihday your majesty,5,21
3,4,0,#model i love u take with u all the time in ...,17,86
4,5,0,factsguide: society now #motivation,8,39


In [8]:
# Averge words length in each tweet

# Lets create a UDF to find the average word length

def avg_words(x):
    words = x.split()
    return (sum(len(y) for y in words)/len(words))

# Lets use the UDF to create the lambda function

fun_avg_words = lambda z : avg_words(z)

# Lets apply the lambda function on the tweet column

df['avg_words_length'] = df['tweet'].apply(fun_avg_words)

df.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_words_length
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789
2,3,0,bihday your majesty,5,21,5.666667
3,4,0,#model i love u take with u all the time in ...,17,86,4.928571
4,5,0,factsguide: society now #motivation,8,39,8.0


In [13]:
# Lets try to find each tweets contain how many stop words

from nltk.corpus import stopwords

stop = stopwords.words('english')

stop

# If you want to extend this list of stopwords by adding few more words

stop.extend(['india', 'abcd'])


In [21]:
# Lets try to find how many stopwords are in each tweets

fun_stop_words = lambda x : len([x for x in x.split() if x in stop])

df['stopwords_count'] = df['tweet'].apply(fun_stop_words)

df.head()

Unnamed: 0,id,label,tweet,word_count,char_count,avg_words_length,stopwords_count
0,1,0,@user when a father is dysfunctional and is s...,21,102,4.555556,10
1,2,0,@user @user thanks for #lyft credit i can't us...,22,122,5.315789,5
2,3,0,bihday your majesty,5,21,5.666667,1
3,4,0,#model i love u take with u all the time in ...,17,86,4.928571,5
4,5,0,factsguide: society now #motivation,8,39,8.0,1


Beg of words - This is simplest analysis which help you to find the words which are used very frequently across all the tweets. To create a beg of words, we must create a DTM (Document term matrix). As a part of DTM, we need to follow the following step:

1) Convert all your data into a lower case
2) Remove the special characters, punct etc
3) We will remove the stop words
4) We will bring the words to the root form - lemmitization
5) Spelling correction - However this step take lot of time(based on your data). Thus I will show you how it work with first 10 rows

In [25]:
# Step 1: Convert the data into lower case - str.lower()

df['lower_text'] = df['tweet'].str.lower()

# Lets pick only the tweet and lower_text column

df1 = df[['tweet', 'lower_text']]

df1.head()

Unnamed: 0,tweet,lower_text
0,@user when a father is dysfunctional and is s...,@user when a father is dysfunctional and is s...
1,@user @user thanks for #lyft credit i can't us...,@user @user thanks for #lyft credit i can't us...
2,bihday your majesty,bihday your majesty
3,#model i love u take with u all the time in ...,#model i love u take with u all the time in ...
4,factsguide: society now #motivation,factsguide: society now #motivation


In [41]:
# Step 2: To remove all the special character and punct. 

df1['clean_text'] = df1['lower_text'].str.replace("[^a-z' ]", '')

print(df1['clean_text'][1])
print(df1['tweet'][1])

user user thanks for lyft credit i can't use cause they don't offer wheelchair vans in pdx    disapointed getthanked
@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['clean_text'] = df1['lower_text'].str.replace("[^a-z' ]", '')


In [44]:
# Step 3: Lets remove the stop words

# Create a UDF to remove the stop words

def sw(x):
    x = [word for word in x.split() if word not in stop]
    return " ".join(x)

df1['txt_non_stop'] = df1['clean_text'].apply(sw)

df1['txt_non_stop'][0]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['txt_non_stop'] = df1['clean_text'].apply(sw)


'user father dysfunctional selfish drags kids dysfunction run'

Lemmatization is basiclly a process to convert a words in its root form. This is a very important step in part of speech tagging. However, in beg of word, it may not be very import. But we will still use this step.


In [47]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/amitchoudhary/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [65]:
# Step 4: Let me give few examples of lemmatization

from nltk.stem import WordNetLemmatizer

# Create a lemmatization object

lemma = WordNetLemmatizer()

# I can use this object to lemmatize any word into its root form

print(lemma.lemmatize('playing', pos = 'v'))
print(lemma.lemmatize('played', pos = 'v'))

play
play


NOTE: As lemmatization work with Part of speech, it will not give be any usefull result if I apply this on my text without Part of Speech tagging. This right now we will not apply this on my text data

In [82]:
# Step 5: Spell correction- this step will take lot of time and thus I
# will only give you a demo how you can use textblob library to spell 
# correction

# You need to install the library textblob

# !pip install textblob

from textblob import TextBlob

# Lets create a dummy data frame 

df2 = pd.DataFrame({"text": ['I was playing crcket. I miss play.',
                            'I was in dehradun in 2020']})

# I will create lambda function from TextBlob

fun_spell = lambda x : str(TextBlob(x).correct())

# lets apply the lambda function of df2

df2['correct_text'] = df2['text'].apply(fun_spell)

df2

Unnamed: 0,text,correct_text
0,I was playing crcket. I miss play.,I was playing cricket. I miss play.
1,I was in dehradun in 2020,I was in dehradun in 2020


In [86]:
# Step 6: Creating a DTM

# Step 6.1: Create a count vectorizer

count_vec = CountVectorizer()

# Step 6.2: We will fit this object on our txt_non_stop column of our df1

count_vec.fit(df1['txt_non_stop'])

# Step 6.3: Create the DTM by using fit_tranform

X = count_vec.fit_transform(df1['txt_non_stop'])

X

<31962x39475 sparse matrix of type '<class 'numpy.int64'>'
	with 245806 stored elements in Compressed Sparse Row format>

Sparse matrix - The matrix sparcity is calculated as the number of cell containing zero values / total number of cells

Total cell = 31962x39475 = 1261699950

Cell containing non zero values = 245806

Cell containing zero values = 1261699950 - 245806 = 1261454144

Sparcity = 1261454144 / 1261699950

The sparcity of the matrix is as high as 99.98%. Such matrix will not give me any meaning information. This is becoz the words used to write the reviews are not very comman among the tweets and thus it will be very difficult to identify any pattern. Thus we must reduce the sparcity. 

NOTE: We will work on sparcity reduction along with TF-IDF

In [98]:
# Lets try to find the top 20 words based on the frequency

# Lets first convert the DTM into a data frame

DTM_DF = pd.DataFrame(X.toarray(), columns = count_vec.get_feature_names())

DTM_DF.head()

# Find the freq of each words

word_freq = DTM_DF.sum()

word_freq

# Lets convert the word_freq into a data frame

word_df = pd.DataFrame(word_freq).reset_index()

word_df

Unnamed: 0,index,0
0,aa,2
1,aaa,3
2,aaaaa,1
3,aaaaaand,1
4,aaaaah,1
...,...,...
39470,zydeco,2
39471,zz,1
39472,zzz,1
39473,zzzzzz,1


In [101]:
# Lets arrange the words_df from highest to lowest order

word_df.sort_values(by = 0, ascending = False).head(20)

Unnamed: 0,index,0
36660,user,17495
20451,love,2727
8351,day,2299
15079,happy,1691
1156,amp,1607
19796,life,1140
35063,time,1125
35225,today,1072
19893,like,1053
23694,new,987
