# Simple Text Mining concept and practice from scratch

This notebook shows 3 different approches that could be taken when performing a text-mining, from it's concept and actual implementation of codes.

## Introduction
Text mining is an approach to find a relationship between two words in a given sentence. It could be found by using:
1) Frequency of appearance of two words
2) Statistical method of extracting connection
3) Word2vec (DL)

There are two prerequisite steps to do before performing text-mining
1) Select the target word
2) Choose the context: choose what is the sentence about

Some of the visualization tools to display text mining results are
1) Gephi
2) Centriufuge
3) Commetrix

## 1) Frequency of appearance of two words

This is a method to count the number of the appearance of target words and other words in a given context. We asuume the higher the frequency, the higher the correlation. We usually set a threshold where we delete the pairs of words that does not reach that threshold.

In [1]:
import pandas as pd
import glob
from afinn import Afinn # Wordlist-based approach for sentiment analysis.
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
import numpy as np
import matplotlib.pyplot as plt
import os

##### Load the data

Data can be found in this kaggle [Link](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews). It's a revised version of IMDB Dataset of 50K Movie Reviews. 

In [2]:
path = 'C:\\Users\\bokhy\\Desktop\\Python\\github\\Python-Projects'  

In [3]:
review = pd.read_csv(os.path.join(path, 'IMDB Dataset.csv'), engine="python")
review.head(10)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\bokhy\\Desktop\\Python\\github\\Python-Projects\\IMDB Dataset.csv'

In [10]:
# Only filter the positive reviews
is_pos_review = review['sentiment'] == 'positive'
pos_review = review[is_pos_review]['review'][0:100] # lets get 100 rows
pos_review.reset_index(inplace=True, drop=True) # reset index
print(pos_review)

0     One of the other reviewers has mentioned that ...
1     A wonderful little production. <br /><br />The...
2     I thought this was a wonderful way to spend ti...
3     Petter Mattei's "Love in the Time of Money" is...
4     Probably my all-time favorite movie, a story o...
                            ...                        
95    I think this movie has got it all. It has real...
96    Howard (Kevin Kline) teaches English at the hi...
97    We usually think of the British as the experts...
98    One of Starewicz's longest and strangest short...
99    Nice character development in a pretty cool mi...
Name: review, Length: 100, dtype: object


In [11]:
tokenizer = RegexpTokenizer('[\w]+')
stop_words = stopwords.words('english')

count = {} # save the frequnecy of appearance as dict here
for line in pos_review:
    words = line.lower() # lower case
    tokens = tokenizer.tokenize(words) # tokenize the words and save it
    stopped_tokens = [i for i in list(set(tokens)) if not i in stop_words+["br"]]
    stopped_tokens2 = [i for i in stopped_tokens if len(i)>1]
    for i,a in enumerate(stopped_tokens2): # index(i) and its corresponding value(a) in stopped_tokens2
        for b in stopped_tokens2[i+1:]:
            if a>b: # a and b are tokens. So it compares the the first character of words 
                count[b,a] = count.get((b,a),0) + 1
            else:
                count[a,b] = count.get((a,b),0) + 1

Dictionary tip: **get**
1) count.get((b,a)) : count에서 key (b,a)에 대응하는 value를 얻는다. 딕셔너리에 (b,a)라는 key가 없을 경우 None을 반환한다.
2) count.get((b,a),0) : get은 최대 2개의 인수를 받을 수 있다. 두 번째 인수는 기본값에 해당한다. key 리스트에 (b,a)가 있을 경우 value를 반환하며, 없을 경우 기본값에 해당하는 0을 반환한다

In [12]:
tokenizer.tokenize(pos_review[0].lower())[0:10]

['one',
 'of',
 'the',
 'other',
 'reviewers',
 'has',
 'mentioned',
 'that',
 'after',
 'watching']

##### change dictionary to dataframe

Dataframe become a single column dataframe, where index is a tuple corresponding to 'key' part in dictionary, and entry is a 'value' in dictionary

In [13]:
df = pd.DataFrame.from_dict(count, orient='index')
df

Unnamed: 0,0
"(mess, surreal)",1
"(may, mess)",1
"(become, mess)",1
"(get, mess)",1
"(mess, side)",1
...,...
"(slowly, taking)",1
"(pretty, slowly)",1
"(pound, taking)",1
"(pound, pretty)",1


Then we pair the two words that appear simultanously and its frequency in a new dataframe

In [14]:
list1=[]
for i in range(len(df)):
    list1.append([df.index[i][0], df.index[i][1], df[0][i]])

df2 = pd.DataFrame(list1, columns=['term1','term2','freq'])
df3 = df2.sort_values(by=['freq'], ascending=False) 
df3 = df3.reset_index(drop=True)
df3.head(20)

Unnamed: 0,term1,term2,freq
0,film,one,31
1,like,movie,24
2,movie,one,24
3,film,like,23
4,film,story,22
5,movie,see,22
6,movie,time,22
7,one,time,21
8,like,one,20
9,movie,really,20


##### The same steps apply to the negative reviews as well

In [15]:
is_neg_review = review['sentiment'] == 'negative'
neg_review = review[is_neg_review]['review'][0:100] # 100개만 추출
neg_review.reset_index(inplace=True, drop=True) # 인덱스 초기화

tokenizer = RegexpTokenizer('[\w]+')
stop_words = stopwords.words('english')

count = {} 
for line in neg_review:
    words = line.lower() 
    tokens = tokenizer.tokenize(words) 
    stopped_tokens = [i for i in list(set(tokens)) if not i in stop_words+["br"]]
    stopped_tokens2 = [i for i in stopped_tokens if len(i)>1]
    for i,a in enumerate(stopped_tokens2):
        for b in stopped_tokens2[i+1:]:
            if a>b:
                count[b,a] = count.get((b,a),0) + 1
            else:
                count[a,b] = count.get((a,b),0) + 1

In [16]:
df = pd.DataFrame.from_dict(count, orient='index')

list1=[]
for i in range(len(df)):
    list1.append([df.index[i][0], df.index[i][1], df[0][i]])

df2 = pd.DataFrame(list1, columns=['term1','term2','freq'])
df3 = df2.sort_values(by=['freq'], ascending=False) # freq 기준으로 내림차순 정렬
df3 = df3.reset_index(drop=True)
df3.head(20)

Unnamed: 0,term1,term2,freq
0,like,movie,47
1,film,movie,37
2,movie,one,34
3,film,like,33
4,good,movie,32
5,like,one,31
6,film,one,31
7,good,like,29
8,even,movie,29
9,movie,would,28


## 2) Statistical method of extracting connection

we would use cosine similarity.

similarity is a quantitative statistical value, so we need weight the words to get the similarity. In simple words, we give adequate number value to each word. To do this, TF-IDF is often used 

$$\mathrm{cosine\ similarity}\ S_{ij} = {A \cdot B \over ||A||\ ||B||} = {\sum_{k} x_{ik} \times x_{jk} \over \sqrt{\sum_{k} (x_{ik})^2} \times \sqrt{\sum_{k} (x_{jk})^2}}$$$$\mathrm{jaccard\ similarity}\ S_{ij} = {\sum_k \mathrm{min}(x_{ik}, x_{jk}) \over \sum_k \mathrm{max}(x_{ik}, x_{jk})}$$$$\mathrm{overlap\ similarity}\ S_{ij} = {\sum_k \mathrm{min}(x_{ik}, x_{jk}) \over \mathrm{min}(\sum_k x_{ik}, \sum_k x_{jk})}$$

x: frequency
i,j: word index
k: sentence index 

In [17]:
import pandas as pd
import glob
from afinn import Afinn
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

In [18]:
review = pd.read_csv(os.path.join(path, 'IMDB Dataset.csv'), engine="python")

# get positive 
is_pos_review = review['sentiment'] == 'positive'
pos_review = review[is_pos_review]['review'][0:100] 
pos_review.reset_index(inplace=True, drop=True) 

In [19]:
stop_words = stopwords.words('english')
vec = TfidfVectorizer(stop_words=stop_words)
vector_pos_review = vec.fit_transform(pos_review)
vector_pos_review

<100x4995 sparse matrix of type '<class 'numpy.float64'>'
	with 10713 stored elements in Compressed Sparse Row format>

In [20]:
# change sparse matrix to regular dataframe
A = vector_pos_review.toarray()
pd.DataFrame(A)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4985,4986,4987,4988,4989,4990,4991,4992,4993,4994
0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.146447,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
96,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
97,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.085182,0.0
98,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.072555,0.000000,0.0


In [21]:
# we use tranpose to change it from sentence-sentence similarity to word-sentence similarity
A=A.transpose()
pd.DataFrame(A)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
1,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
2,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
3,0.0,0.0,0.0,0.0,0.0,0.074655,0.0,0.0,0.0,0.0,...,0.0,0.0,0.074307,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
4,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4990,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
4991,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0
4992,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.000000,0.072555,0.0
4993,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.085182,0.000000,0.0


In [22]:
# Change it again to Spare matrix
A_sparse = sparse.csr_matrix(A)
similarities_sparse = cosine_similarity(A_sparse, dense_output=False)
list(similarities_sparse.todok().items())[35000:35010]

[((1098, 133), 0.3484457223054042),
 ((1099, 133), 0.4726439154636967),
 ((1104, 133), 0.3382677891921974),
 ((1133, 133), 0.2738023270540216),
 ((1134, 133), 0.17092631488618495),
 ((1154, 133), 0.4726439154636967),
 ((1173, 133), 0.4726439154636967),
 ((1184, 133), 0.4726439154636967),
 ((1209, 133), 0.1523951677290328),
 ((1211, 133), 0.8812534988158323)]

In [23]:
print(vec.get_feature_names()[1098])
print(vec.get_feature_names()[133])

dead
affected


In [24]:
vec.get_feature_names()[100:105]

['active', 'activities', 'actor', 'actors', 'actress']

In [25]:
# get dataframe in desc order in similarity
df = pd.DataFrame(list(similarities_sparse.todok().items()), columns=['words', 'weight'])
df2 = df.sort_values(by=['weight'], ascending=False)
df2 = df2.reset_index(drop=True)
df3 = df2.loc[np.round(df2['weight']) < 1]
df3 = df3.reset_index(drop=True)

df3.head(10)

Unnamed: 0,words,weight
0,"(616, 1511)",0.5
1,"(1511, 616)",0.5
2,"(2929, 2082)",0.499995
3,"(2082, 2929)",0.499995
4,"(3483, 69)",0.499987
5,"(4701, 3483)",0.499987
6,"(3483, 1886)",0.499987
7,"(2033, 3483)",0.499987
8,"(4680, 3483)",0.499987
9,"(3483, 4987)",0.499987


## 3) Word2vec (DL)

In [26]:
import pandas as pd
import glob
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
import numpy as np
from gensim.models.word2vec import Word2Vec

First few steps are similar to what we did in another approach

In [27]:
review = pd.read_csv(os.path.join(path, 'IMDB Dataset.csv'), engine="python")

# get positive 
is_pos_review = review['sentiment'] == 'positive'
pos_review = review[is_pos_review]['review'][0:100] 
pos_review.reset_index(inplace=True, drop=True) 

In [28]:
tokenizer = RegexpTokenizer('[\w]+')
stop_words = stopwords.words('english')

text = [] 
for line in pos_review:
    words = line.lower() 
    tokens = tokenizer.tokenize(words) 
    stopped_tokens = [i for i in list(set(tokens)) if not i in stop_words+["br"]]
    stopped_tokens2 = [i for i in stopped_tokens if len(i)>1]
    text.append(stopped_tokens2)

In [29]:
model = Word2Vec(text, 
                 sg=1, #enable skip-gram
                 window=2, # apply it to both left right side at max of 2 words
                 min_count=3) # onyl use words that appeared more than 3 times
model.init_sims(replace=True)

In [31]:
# This shows the similarity between 'film' and 'movie'
model.wv.similarity('film', 'movie')

0.70896316

In [32]:
# Top 5 words that are most close to the word "good" 
model.wv.most_similar("good", topn=5)

[('time', 0.7345631122589111),
 ('work', 0.7224764823913574),
 ('well', 0.7197010517120361),
 ('movie', 0.7130800485610962),
 ('film', 0.6941635012626648)]