# NLP Notes:
Natural Language Processing is a branch of Artificial Intelligence that analyzes, processes, and efficiently retrieves information text data. By utilizing the power of NLP one can solve a huge range of real-world problems which include summarizing documents, title generator, caption generator, fraud detection, speech recognition, recommendation system, machine translation, etc.

## Import Common packages

In [1]:
import numpy as np
import pandas as pd
import re
import string
import math

## Import NLP related packages

In [2]:
#pip install contractions
import contractions
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords
#nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

## Import Data and Drop duplicates

In [3]:
df = pd.read_csv('C:/Users/NLP/data/media_group.csv')
# Drop duplicates
df.drop_duplicates(inplace=True)

In [4]:
df.head(10)

Unnamed: 0,focus_group_subtype,focus_group_subtype_id,doc_no_within_subtype,question_id,question_text,parent_num,parent_answer
0,media_group,3,1,2,how did your child use technology before the p...,5,My son goes to our charter school. Before the ...
1,media_group,3,1,2,how did your child use technology before the p...,1,It was pretty minimal for school. It was mostl...
2,media_group,3,1,2,how did your child use technology before the p...,4,My child also had access to the computer befor...
3,media_group,3,1,2,how did your child use technology before the p...,3,My son before the pandemic was mostly iPad for...
4,media_group,3,1,2,how did your child use technology before the p...,2,"My son's older through all these kids, but he'..."
5,media_group,3,1,2,how did your child use technology before the p...,1,"I don't know. It's been ... I mean, because sh..."
6,media_group,3,1,2,how did your child use technology before the p...,6,Okay. I have an 11-year-old boy who is in sixt...
7,media_group,3,1,3,what do you anticipate as people return to in ...,3,"Okay. Real quick. Both my kids, my son is ADHD..."
8,media_group,3,1,3,what do you anticipate as people return to in ...,1,"The return to school, I mean, right now, she's..."
9,media_group,3,1,3,what do you anticipate as people return to in ...,4,I would say my son's school is not going back ...


# Preprocessing of text Data
1. Expand contraction
2. Case handling
3. Remove punctuations
4. Remove words and digits containing digits
5. Remove stop word
6. Lemmatization
7. Remove Extra Spaces 

#### 1. Expand contraction
Contraction is the shortened form of a word like don’t stands for do not, aren’t stands for are not. Like this, we need to expand this contraction in the text data for better analysis.

In [5]:
def expand_contraction(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].apply(lambda text:contractions.fix(text))
        
    return df

#### 2. Case handling
If the text is in the same case, it is easy for a machine to interpret the words because the lower case and upper case are treated differently by the machine. for example, words like Ball and ball are treated differently by machine. So, we need to make the text in the same case and the most preferred case is a lower case to avoid such problems.

In [6]:
def case_handling(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].str.lower() 
        
    return df       

#### 3. Remove punctuations
One of the other text processing techniques is removing punctuations. there are total 32 main punctuations that need to be taken care of. we can directly use the string module with a regular expression to replace any punctuation in text with an empty string

In [7]:
def remove_punctuations(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].apply(lambda text: re.sub('[%s]' % re.escape(string.punctuation), '' , text))
        
    return df   

#### 4. Remove words and digits containing digits
Sometimes it happens that words and digits combine are written in the text which creates a problem for machines to understand. hence, We need to remove the words and digits which are combined like game57 or game5ts7. This type of word is difficult to process so better to remove them or replace them with an empty string. we use regular expressions for this. 

In [8]:
def remove_words_dgits(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].apply(lambda text: re.sub('W*dw*','',text))
        
    return df

#### 5. Remove stopword
Stopwords are the most commonly occurring words in a text which do not provide any valuable information. stopwords like they, there, this, where, etc are some of the stopwords. NLTK library is a common library that is used to remove stopwords and include approximately 180 stopwords which it removes. If we want to add any new word to a set of words then it is easy using the add method.

In [9]:
def remove_stopwords(df, columns=[]):
    
    stop_words = set(stopwords.words('english'))
    
    def remove_sw(text):
        txt_output = " ".join([word for word in str(text).split() if word not in stop_words])
        return txt_output
    
    for col in columns:
        df[col] = df[col].apply(lambda text: remove_sw(text))
    
    return df

#### 6. Lemmatization
Lemmatization is similar to stemming, used to stem the words into root word but differs in working. Actually, Lemmatization is a systematic way to reduce the words into their lemma by matching them with a language dictionary.

In [10]:
def lemmatize_words(df, columns=[]):
    
    lemmatizer = WordNetLemmatizer()
    
    def lemmatize(text):
        text_output = " ".join([lemmatizer.lemmatize(word) for word in text.split()])
        return text_output
    
    for col in columns:
        df[col] = df[col].apply(lambda text: lemmatize(text))
        
    return df

#### 7. Remove Extra Spaces 
Most of the time text data contain extra spaces or while performing the above preprocessing techniques more than one space is left between the text so we need to control this problem. regular expression library performs well to solve this problem

In [11]:
def remove_extra_spaces(df,columns=[]):
    
    for col in columns:
        df[col] = df[col].apply(lambda text: re.sub(' +', ' ', text))
        
    return df 

# Data preprocessing entry point

In [12]:
def data_preprocessing(df, columns=[]):
    
    df = expand_contraction(df,columns)
    df = case_handling(df,columns) 
    df = remove_punctuations(df,columns)
    #df = remove_words_dgits(df,columns)  
    df = remove_stopwords(df,columns) 
    df = lemmatize_words(df, columns)
    df = remove_extra_spaces(df,columns) 
    
    return df

In [13]:
columns=['parent_answer', 'question_text']
output_df =  data_preprocessing(df, columns)

In [14]:
output_df.head(10)

Unnamed: 0,focus_group_subtype,focus_group_subtype_id,doc_no_within_subtype,question_id,question_text,parent_num,parent_answer
0,media_group,3,1,2,child use technology pandemic educational purpose,5,son go charter school pandemic using computer ...
1,media_group,3,1,2,child use technology pandemic educational purpose,1,pretty minimal school mostly mean use chromebo...
2,media_group,3,1,2,child use technology pandemic educational purpose,4,child also access computer pandemic school als...
3,media_group,3,1,2,child use technology pandemic educational purpose,3,son pandemic mostly ipad accommodation would u...
4,media_group,3,1,2,child use technology pandemic educational purpose,2,son older kid still trying get diploma high sc...
5,media_group,3,1,2,child use technology pandemic educational purpose,1,know mean remote certain time mean horribleit ...
6,media_group,3,1,2,child use technology pandemic educational purpose,6,okay 11yearold boy sixth grade sevenyearold gi...
7,media_group,3,1,3,anticipate people return person concern anythi...,3,okay real quick kid son adhd asd daughter adhd...
8,media_group,3,1,3,anticipate people return person concern anythi...,1,return school mean right three day one week tw...
9,media_group,3,1,3,anticipate people return person concern anythi...,4,would say son school going back session going ...


# TD-IDF Vectorizer VS TD-IDF Transformer
TF-IDF is a score which is applied to every word in every document in our dataset. And for every word, the TF-IDF value increases with every appearance of the word in a document, but is gradually decreased with every appearance in other documents. 

Theoretically speaking, there is actually no difference between the 2 implementations. Practically speaking, we need to write some more code if we want to use TfidfTransformer. The main difference between the 2 implementations is that TfidfVectorizer performs both term frequency and inverse document frequency for you, while using TfidfTransformer will require you to use the CountVectorizer class from Scikit-Learn to perform Term Frequency.

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [16]:
corpus = output_df['parent_answer'].tolist()
tdidf_vectorizer = TfidfVectorizer(use_idf=True)
tfIdf = tdidf_vectorizer.fit_transform(corpus)
tfidf_df = pd.DataFrame(tfIdf[0].T.todense(), index=tdidf_vectorizer.get_feature_names(), columns=["TF-IDF"])
tfidf_df = tfidf_df.sort_values("TF-IDF", ascending=False)

In [17]:
tfidf_df.head(10)

Unnamed: 0,TF-IDF
understood,0.370622
book,0.282573
access,0.259517
able,0.238669
always,0.19019
framework,0.185311
britannica,0.185311
search,0.185311
beneficial,0.185311
encyclopedia,0.185311


# Unsupervised Sentiment Analysis

In [18]:
import multiprocessing

#pip install gensim
from gensim.models.phrases import Phrases, Phraser
from gensim.models import Word2Vec
from gensim.test.utils import get_tmpfile
from gensim.models import KeyedVectors


from time import time 
from collections import defaultdict

In [19]:
output_df['parent_answer'] =  output_df['parent_answer'].apply(lambda text: text.split())

In [21]:
sent = [row for row in output_df.parent_answer]
phrases = Phrases(sent, min_count=1, progress_per=50000)
bigram = Phraser(phrases)
sentences = bigram[sent]

In [22]:
w2v_model = Word2Vec(min_count=3,
                     window=4,
                     #size=300,
                     sample=1e-5, 
                     alpha=0.03, 
                     min_alpha=0.0007, 
                     negative=20,
                     workers=multiprocessing.cpu_count()-1)

start = time()

w2v_model.build_vocab(sentences, progress_per=50000)

print('Time to build vocab: {} mins'.format(round((time() - start) / 60, 2)))

Time to build vocab: 0.0 mins


In [23]:
start = time()

w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - start) / 60, 2)))

w2v_model.init_sims(replace=True)

Time to train the model: 0.01 mins


  w2v_model.init_sims(replace=True)


In [24]:
w2v_model.save("word2vec.model")

In [25]:
file_export = output_df.copy()
file_export['old_parent_answer'] = file_export.parent_answer
file_export.old_parent_answer = file_export.old_parent_answer.str.join(' ')
file_export.parent_answer = file_export.parent_answer.apply(lambda x: ' '.join(bigram[x]))
#file_export.rate = file_export.rate.astype('int8')

In [26]:
file_export[['parent_answer']].to_csv('cleaned_dataset.csv', index=False)

In [27]:
file_export.head()

Unnamed: 0,focus_group_subtype,focus_group_subtype_id,doc_no_within_subtype,question_id,question_text,parent_num,parent_answer,old_parent_answer
0,media_group,3,1,2,child use technology pandemic educational purpose,5,son go_charter school pandemic using_computer ...,son go charter school pandemic using computer ...
1,media_group,3,1,2,child use technology pandemic educational purpose,1,pretty minimal school mostly mean use chromebo...,pretty minimal school mostly mean use chromebo...
2,media_group,3,1,2,child use technology pandemic educational purpose,4,child also access computer pandemic school als...,child also access computer pandemic school als...
3,media_group,3,1,2,child use technology pandemic educational purpose,3,son pandemic mostly ipad accommodation would u...,son pandemic mostly ipad accommodation would u...
4,media_group,3,1,2,child use technology pandemic educational purpose,2,son_older kid still trying_get diploma high_sc...,son older kid still trying get diploma high sc...


# KMeans clustering

In [28]:
from sklearn.cluster import KMeans

In [29]:
word_vectors = Word2Vec.load("word2vec.model").wv

In [30]:
model = KMeans(n_clusters=2, max_iter=1000, random_state=True, n_init=50).fit(X=word_vectors.vectors.astype('double'))

In [31]:
word_vectors.similar_by_vector(model.cluster_centers_[1], topn=10, restrict_vocab=None)

[('like', 0.9999430179595947),
 ('going', 0.9999169111251831),
 ('thing', 0.9999097585678101),
 ('lot', 0.9999040365219116),
 ('kid', 0.999903678894043),
 ('day', 0.9999033808708191),
 ('class', 0.9998947978019714),
 ('something', 0.9998836517333984),
 ('time', 0.9998831748962402),
 ('really', 0.9998794794082642)]

In [32]:
positive_cluster_index = 1
positive_cluster_center = model.cluster_centers_[positive_cluster_index]
negative_cluster_center = model.cluster_centers_[1-positive_cluster_index]

In [33]:
words = pd.DataFrame(word_vectors.index_to_key)
words.columns = ['words']
words['vectors'] = words.words.apply(lambda x: word_vectors[f'{x}'])
words['cluster'] = words.vectors.apply(lambda x: model.predict([np.array(x)]))
words.cluster = words.cluster.apply(lambda x: x[0])

In [34]:
words['cluster_value'] = [1 if i==positive_cluster_index else -1 for i in words.cluster]
words['closeness_score'] = words.apply(lambda x: 1/(model.transform([x.vectors]).min()), axis=1)
words['sentiment_coeff'] = words.closeness_score * words.cluster_value

In [35]:
words.head(10)

Unnamed: 0,words,vectors,cluster,cluster_value,closeness_score,sentiment_coeff
0,like,"[-0.087506786, 0.06588098, 0.08399521, 0.04410...",1,1,93.324066,93.324066
1,thing,"[-0.089280754, 0.06674363, 0.084365815, 0.0435...",1,1,74.336956,74.336956
2,school,"[-0.08718752, 0.06679961, 0.08080908, 0.041982...",1,1,59.851661,59.851661
3,going,"[-0.089307986, 0.06810554, 0.083315134, 0.0419...",1,1,77.387419,77.387419
4,kid,"[-0.08896042, 0.06643233, 0.081441745, 0.04171...",1,1,71.990519,71.990519
5,know,"[-0.08963193, 0.06641954, 0.08263298, 0.039908...",1,1,61.301438,61.301438
6,really,"[-0.084751, 0.06429121, 0.082660295, 0.0423803...",1,1,64.353029,64.353029
7,go,"[-0.08496199, 0.0647335, 0.08556023, 0.0445570...",1,1,62.249215,62.249215
8,time,"[-0.09002568, 0.06858209, 0.08407999, 0.044753...",1,1,65.31995,65.31995
9,think,"[-0.08879573, 0.06407977, 0.080693446, 0.04480...",1,1,62.305008,62.305008


In [36]:
words[['words', 'sentiment_coeff']].to_csv('sentiment_dictionary.csv', index=False)

# Predictions

In [37]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from IPython.display import display

In [38]:
final_file = pd.read_csv('cleaned_dataset.csv')

In [39]:
sentiment_map = pd.read_csv('sentiment_dictionary.csv')
sentiment_dict = dict(zip(sentiment_map.words.values, sentiment_map.sentiment_coeff.values))

In [40]:
file_weighting = final_file.copy()

In [41]:
tfidf = TfidfVectorizer(tokenizer=lambda y: y.split(), norm=None)
tfidf.fit(file_weighting.parent_answer)
features = pd.Series(tfidf.get_feature_names())
transformed = tfidf.transform(file_weighting.parent_answer)



In [42]:
def create_tfidf_dictionary(x, transformed_file, features):
    '''
    create dictionary for each input sentence x, where each word has assigned its tfidf score
    
    inspired  by function from this wonderful article: 
    https://medium.com/analytics-vidhya/automated-keyword-extraction-from-articles-using-nlp-bfd864f41b34
    
    x - row of dataframe, containing sentences, and their indexes,
    transformed_file - all sentences transformed with TfidfVectorizer
    features - names of all words in corpus used in TfidfVectorizer

    '''
    vector_coo = transformed_file[x.name].tocoo()
    vector_coo.col = features.iloc[vector_coo.col].values
    dict_from_coo = dict(zip(vector_coo.col, vector_coo.data))
    return dict_from_coo

def replace_tfidf_words(x, transformed_file, features):
    '''
    replacing each word with it's calculated tfidf dictionary with scores of each word
    x - row of dataframe, containing sentences, and their indexes,
    transformed_file - all sentences transformed with TfidfVectorizer
    features - names of all words in corpus used in TfidfVectorizer
    '''
    dictionary = create_tfidf_dictionary(x, transformed_file, features)   
    return list(map(lambda y:dictionary[f'{y}'], x.parent_answer.split()))

In [43]:
%%time
replaced_tfidf_scores = file_weighting.apply(lambda x: replace_tfidf_words(x, transformed, features), axis=1)
#this step takes around 3-4 minutes minutes to calculate

Wall time: 16.2 ms


In [44]:
def replace_sentiment_words(word, sentiment_dict):
    '''
    replacing each word with its associated sentiment score from sentiment dict
    '''
    try:
        out = sentiment_dict[word]
    except KeyError:
        out = 0
    return out

In [45]:
replaced_closeness_scores = file_weighting.parent_answer.apply(lambda x: list(map(lambda y: replace_sentiment_words(y, sentiment_dict), x.split())))

In [46]:
replacement_df = pd.DataFrame(data=[replaced_closeness_scores, replaced_tfidf_scores, file_weighting.parent_answer]).T
replacement_df.columns = ['sentiment_coeff', 'tfidf_scores', 'parent_answer']
replacement_df['sentiment_rate'] = replacement_df.apply(lambda x: np.array(x.loc['sentiment_coeff']) @ np.array(x.loc['tfidf_scores']), axis=1)
replacement_df['prediction'] = (replacement_df.sentiment_rate>0).astype('int8')

In [47]:
# pd.set_option('display.max_columns', None)  # or 1000
# pd.set_option('display.max_rows', None)  # or 1000
# pd.set_option('display.max_colwidth', -1)  # or 199

In [48]:
final_df = replacement_df
final_df.head(10)

Unnamed: 0,sentiment_coeff,tfidf_scores,parent_answer,sentiment_rate,prediction
0,"[51.16967823000269, 12.150041364467253, 59.851...","[2.2264456601779945, 3.70805020110221, 2.97769...",son go_charter school pandemic using_computer ...,5442.709665,1
1,"[25.651072669700238, 11.05353918420877, 59.851...","[3.5257286443082556, 8.437751649736402, 4.4665...",pretty minimal school mostly mean use chromebo...,3627.236638,1
2,"[38.258617769129565, 51.08810571118995, 35.694...","[2.83258146374831, 10.909969488035802, 6.47609...",child also access computer pandemic school als...,13693.814362,1
3,"[51.16967823000269, 29.082246264894614, 19.889...","[2.2264456601779945, 4.545931351625775, 3.3715...",son pandemic mostly ipad accommodation would u...,9949.746402,1
4,"[0, 71.99051860639777, 47.754613623606055, 24....","[4.218875824868201, 3.2078320936640052, 2.1819...",son_older kid still trying_get diploma high_sc...,11418.01884,1
5,"[61.30143787442263, 56.63963855498419, 0, 65.3...","[1.556287997842748, 5.955850810083319, 4.21887...",know mean remote_certain time mean_horribleit ...,4040.034258,1
6,"[25.5662280334024, 27.615178486745155, 27.1838...","[3.0149030205422647, 3.70805020110221, 3.70805...",okay 11yearold boy sixth_grade sevenyearold gi...,34266.244464,1
7,"[25.5662280334024, 16.73378992578032, 71.99051...","[3.0149030205422647, 3.9311937524164198, 3.207...",okay real_quick kid son_adhd asd daughter adhd...,19929.738922,1
8,"[0, 59.85166100302707, 56.63963855498419, 31.5...","[4.624340932976365, 13.399620453424939, 3.9705...",return school mean right three_day one week_tw...,32266.332149,1
9,"[44.399703506099776, 51.16967823000269, 59.851...","[2.6094379124341005, 2.2264456601779945, 10.42...",would_say son school going_back session going ...,37961.217897,1
