# Textmining Tasks

The notebook contains the following text-mining functions:
1. Spelling Recommender
2. Document Similarity Detecter
3. Topic Detecter
4. Spam Detecter
5. Date-Parser

## 1. Spelling Recommender

This Spelling recommender takes a misspelled word and recommends a correctly spelled word. It finds the word in correct_spellings that has the shortest jaccard distance, and starts with the same letter as the misspelled word. 

In [1]:
def spell_right(misspelled_word):
    import nltk
    from nltk.corpus import words
    correct_spellings = words.words()
    first_only = [i for i in correct_spellings if i[0]== misspelled_word[0]]
    dist = [(nltk.jaccard_distance(set(nltk.ngrams(misspelled_word, n=3)),set(nltk.ngrams(a, n=3))),a) for a in  first_only]
    return sorted(dist)[0][1]

In [2]:
#test 1
spell_right("bnana")

'banana'

In [3]:
#test 2
spell_right("cocnut")

'coconut'

##  2. Document Similarity Detecter

computes the symmetrical path similarity between two documents by 
- finding the synsets in each document with converted tags to be read in the wordnet-format
- computing similarities using path_similarity

In [4]:
def document_path_similarity(doc1, doc2):
    import nltk
    from nltk.corpus import wordnet as wn
    from nltk import pos_tag

    def doc_to_synsets(string):
        tokens= nltk.word_tokenize(string) 
        tags = nltk.pos_tag(tokens)
        wordnet_tags = []
        def convert_tag(tag): 
            tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
            try:
                return tag_dict[tag[0]]
            except KeyError:
                return None
        for tag in tags:
            try:
                wordnet_tags.append([tag[0] ,convert_tag(tag[1])])
            except KeyError:
                pass
        s1 = []
        for tag in wordnet_tags:
            try:
                s1.append(wn.synsets(tag[0], pos=tag[1])[0]) 
            except Exception as err:
                pass
        return s1
    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)
    def similarity_score(s1, s2):
        max_sims=[]
        for i in s1:
            maxis = []
            for y in s2:
                sim = i.path_similarity(y)
                maxis.append(sim)
            maxi_i = [n for n in maxis if n is not None]
            maxi_i2= [x for x in maxi_i if x]
            try:
                maxi_i3 = max(maxi_i2)
                max_sims.append(maxi_i3)
            except:
                pass
            summe=sum(max_sims)
            maxmax = len(max_sims)
            try:
                index=summe/maxmax
            except:
                pass
        return index
    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2


In [5]:
#test
doc1 = 'This is a function and we want to know if it correctly finds out the path similarity.'
doc2 = 'I will use this function to check if the code is correct and detects the path similarity'
document_path_similarity(doc1, doc2)
    

0.5485141093474428

## 3. Topic Detecter

In [6]:
#importing libraries
import pickle
import gensim
from sklearn.feature_extraction.text import TfidfVectorizer



In [7]:
#loading data-file from newsgroups using pickle
with open('newsgroups', 'rb') as f: 
    newsgroup_data = pickle.load(f)

In [11]:
#prepocessing the data using the Tfidf-Vectorizer
#finding tokens that occur maximal in 20% of the docs and in minumum 20 of the docs, ignoring stopwords
#only taking into account tokens with minimum 3 alphanumerical characters
vect = TfidfVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w\\w+\\b') 

X = vect.fit_transform(newsgroup_data)

In [12]:
#buliding the lda model

#creating the document term matrix
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
#creating the dictionary with the counts as keys and tikens as values
id_map = dict((v, k) for k, v in vect.vocabulary_.items())
#fitting the model looking for 10 topics
ldamodel=gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word =id_map, passes=50, random_state=0)

In [13]:
#printing the 10 topics with the 30 most fequent tokens
topics = ldamodel.print_topics(num_topics=10, num_words=30)
topics

[(0,
  '0.046*"thanks" + 0.027*"advance" + 0.020*"group" + 0.019*"heard" + 0.017*"types" + 0.014*"answer" + 0.014*"question" + 0.013*"looking" + 0.013*"copy" + 0.011*"considering" + 0.010*"address" + 0.010*"interested" + 0.009*"does" + 0.009*"opinions" + 0.009*"card" + 0.008*"mail" + 0.008*"know" + 0.008*"theory" + 0.008*"wondering" + 0.008*"email" + 0.007*"faster" + 0.007*"controller" + 0.007*"program" + 0.007*"just" + 0.007*"body" + 0.007*"work" + 0.007*"says" + 0.007*"wouldn" + 0.007*"buying" + 0.007*"record"'),
 (1,
  '0.033*"game" + 0.026*"team" + 0.024*"year" + 0.018*"games" + 0.017*"season" + 0.017*"players" + 0.017*"play" + 0.015*"hockey" + 0.014*"league" + 0.011*"rangers" + 0.011*"baseball" + 0.011*"teams" + 0.010*"toronto" + 0.010*"think" + 0.010*"division" + 0.009*"leafs" + 0.009*"fans" + 0.009*"played" + 0.009*"smith" + 0.009*"player" + 0.008*"detroit" + 0.008*"great" + 0.008*"chicago" + 0.008*"playoffs" + 0.008*"runs" + 0.008*"boston" + 0.007*"goal" + 0.007*"home" + 0.007*

## 4. Spam Detecter

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
#loading the dataset with labeled spam and nonspam data
spam_data = pd.read_csv('spam2.csv',sep="\t", quotechar = "|")
spam_data.head()

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham


In [16]:
#convert labels to numbers 0,1
spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_percent = spam_data["target"].value_counts()[1]/len(spam_data["target"])*100
print("%Spam", spam_percent)
spam_data.head(10)

%Spam 13.18858783420061


Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [17]:
#creating test and train sets
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)

In [18]:
#using a count-vectorizer to preprocess the data
vect = CountVectorizer(min_df=3).fit(X_train)
X_train_vectorized = vect.transform(X_train)
X_test_vectorized = vect.transform(X_test)
#fitting a multinomial Naive Bayes classifier model with alpha=0.1
MNNB = MultinomialNB(alpha = 0.1)
MNNB.fit(X_train_vectorized, y_train)
predictions = MNNB.predict(X_test_vectorized)
#calculating the area under the curve (AUC) 
auc = roc_auc_score(y_test, predictions)
auc

0.9663809628462937

##### checking if additional parameters can add to the prediction

In [10]:
#adding additional parameters
spam_data["lenght"] = spam_data['text'].apply(lambda x: len(x))
spam_data["digits"] = spam_data['text'].str.count('\d') 
spam_data["non-word"] = spam_data['text'].str.count('\W') 
spam = spam_data[spam_data["target"] ==1]
nonspam = spam_data[spam_data["target"] !=1]

#mean lenght of the data - spam versus non-spam
spam_av = spam["lenght"].mean()
nonspam_av = nonspam["lenght"].mean()
print("nonspam average lenght:", nonspam_av, "spam average lenght:", spam_av)

#mean number of digits in text - spam versus non-spam
spam_av_d = spam["digits"].mean()
nonspam_av_d = nonspam["digits"].mean()
print("nonspam average digits:", nonspam_av_d, "spam average digits:", spam_av_d )

#mean number of non-alphanumeric characters - spam versus nonspam
spam_av_nonw = spam["non-word"].mean()
nonspam_av_nonw = nonspam["non-word"].mean()
print("nonspam average non-alphanumeric characters:", spam_av_nonw, "spam average non-alphanumeric characters:", nonspam_av_nonw )

nonspam average lenght: 71.1486151302191 spam average lenght: 139.19727891156464
nonspam average digits: 0.32348077718065316 spam average digits: 15.851700680272108
nonspam average non-alphanumeric characters: 29.058503401360543 spam average non-alphanumeric characters: 17.365440264572136


In [11]:
#create function for adding features in the textmatrix
def add_feature(X, feature_to_add):
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

In [12]:
#fitting the model with the added Parameters
X_train_n= X_train.to_frame()
X_train_chars = X_train_n['text'].apply(lambda x: len(x))
X_train_dig = X_train_n['text'].str.count('\d') 
X_train_wob = X_train_n['text'].str.count('\W') 
X_test_n =  X_test.to_frame()
X_test_chars = X_test_n['text'].apply(lambda x: len(x))
X_test_dig = X_test_n['text'].str.count('\d') 
X_test_wob= X_test_n['text'].str.count('\W') 

In [13]:
Xneu_train = add_feature(X_train_vectorized, [X_train_chars, X_train_dig, X_train_wob] )
Xneu_test = add_feature(X_test_vectorized, [X_test_chars, X_test_dig, X_test_wob])
Xtrain_final = Xneu_train.toarray()
Xtest_final = Xneu_test.toarray()

In [14]:
MNNB = MultinomialNB(alpha = 0.1)
MNNB.fit(Xtrain_final,y_train)
predictions = MNNB.predict(Xtest_final)
auc = roc_auc_score(y_test, predictions)
auc

0.9777277185613876

## 5. Date-Parser

The Parser finds all Dates in formats like:

* 04/20/2009; 04/20/09; 4/20/09; 4/3/09...
* Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009...
* 20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009...
* Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009...
* Feb 2009; Sep 2009; Oct 2010...
* 6/2008; 12/2009...
* 5 B.C.; 50 BC; 500 A.D.; 500 AD; A.D. 50; BC 500...
* 2009; 1983...

The function returns a dataframe including the following colums:
* day
* month
* year
* BC_AD
* text_before_date -  gets only filled, if there is only the year provided, to avoid confusion with other numbers.
* text_after_date - gets only filled, if there is only the year provided, to avoid confusion with other numbers


Missing information about any of the columns content is marked as NAN


In [16]:
def date_parser(list_of_strings):
    import pandas as pd
    s = pd.Series(list_of_strings)
    df= s.to_frame()
 
    dfe0 = df[0].str.extract(r'(?P<month>Jan|Feb|Mar|March|Apr|April|May|Jun|June|Jul|July|Aug|Sep|Oct|Nov|Dec|December|Decemeber)\D{,10}(?P<day>\d?\d)\D\D?\D?\D?(?P<year>\d{2,4})')
    dfe0_neu = dfe0.dropna(axis=0, how='all')
    a = dfe0_neu.index.tolist()
    dfa = df.drop(df.index[(a)])
    
    dfe1 = dfa[0].str.extract(r'(?P<day>\d?\d)\D(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|July|Aug|Sep|Oct|Nov|Dec|December|Decemeber)\D{,10}(?P<year>\d{2,4})')
    dfe1_neu = dfe1.dropna(axis=0, how='all')
    b= dfe1_neu.index.tolist()
    dfb = dfa.drop(df.index[(b)])
    
    dfe2= dfb[0].str.extract(r'(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|July|Aug|Sep|Oct|Nov|Dec)\D{,10}(?P<day>)(?P<year>\d{2,4})')
    dfe2_neu = dfe2.dropna(axis=0, how='all')
    c= dfe2_neu.index.tolist()
    dfc = dfb.drop(df.index[(c)])
    
    dfe3 = dfc[0].str.extract(r'(?P<month>\d?\d)(/|-)(?P<day>\d?\d)(/|-)(?P<year>\d{2,4})')
    dfe3_neu = dfe3.dropna(axis=0, how='all')
    d= dfe3_neu.index.tolist()
    dfd = dfc.drop(df.index[(d)])
    dfe3_neu.drop([1,3], axis=1, inplace=True)
    
    dfe4 = dfd[0].str.extract(r'(?P<month>\d?\d)(/|-)(?P<day>)(?P<year>\d{2,4})')
    dfe4_neu = dfe4.dropna(axis=0, how='all')
    dfe4_neu.drop([1], axis=1, inplace=True)
    e = dfe4_neu.index.tolist()
    dfe = dfd.drop(df.index[(e)])
    
    #dates with BC AD before the date
    dfBCAD = dfe[0].str.extract(r'(?P<month>)(?P<day>)(?P<BC_AD>[ACBD.]{2,4}) (?P<year>\d{1,4})')
    dfBCAD_neu = dfBCAD.dropna(axis=0, how='all')
    f = dfBCAD_neu.index.tolist()
    dff = dfe.drop(df.index[(f)])
    
    #dates with BC AD after the date
    dfBCAD2 = dff[0].str.extract(r'(?P<month>)(?P<day>)(?P<year>\d{1,4}) (?P<BC_AD>[ACBD.]{2,4})')
    dfBCAD2_neu = dfBCAD2.dropna(axis=0, how='all')
    g = dfBCAD2_neu.index.tolist()
    dfg = dff.drop(df.index[(g)])
    
    #only dates provided
    df_dates_only = dfg[0].str.extract(r'(?P<text_before_date>\D{0,20})(?P<year>\d{4})(?P<text_after_date>\D{0,20})')
    df_dates_only_neu = df_dates_only.dropna(axis=0, how='all')
   
    frames = [dfe0_neu, dfe1_neu, dfe2_neu, dfe3_neu, dfe4_neu, dfBCAD_neu, dfBCAD2_neu, df_dates_only_neu] #, dfBCAD2_neu
    final = pd.concat(frames, sort=True)
    #final['year'] = final['year'].astype('int64')
 
    nums = [1,2,3,4,5,6,7,8,9,10,11,12]
    month = ['Jan', 'Feb', 'Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
   
    for i,y in zip(nums, month):
        final["month"].replace(y, i, inplace = True)##
    
    import numpy as np
    for i in final.columns:
        final[i].replace('', np.nan, inplace = True)

    #reorder columns
    cols = list(final)
    for i in ['day', 'month', 'year', 'text_after_date']:
        cols.remove(i)
    cols.insert(0,'day')
    cols.insert(1,'month')
    cols.insert(2,'year')
    cols.insert(5,'text_after_date')
    final= final.loc[:, cols] 
    return final

In [17]:
#testing the parser
test_list = ['6/18/85 the primary Care Doctor found....','she plans to move in 07/8/1871 to In-Home Services', 'B.C. 215 he wondered if....', '1985 it happened suddenly....', 'In Mar 20th 2010 we met in a cafeteria', '23 AD they met in a village']
date_parser(test_list)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


Unnamed: 0,day,month,year,BC_AD,text_before_date,text_after_date
4,20.0,3.0,2010,,,
0,18.0,6.0,85,,,
1,8.0,7.0,1871,,,
2,,,215,B.C.,,
5,,,23,AD,,
3,,,1985,,,it happened suddenl
