# Searching for Unique Text Features and Normalizing Variations

In this notebook I will explore the training dataset with the aim of finding relevant features to create our model.

First of all, I have tried to make some sense of the variants and genes provided in the input files.

Then, I have explored among the terms of the training texts taking into account their appearances in the different classes:
    1- What are the terms that are unique in each class. How many documents in the training set can be classified only with this bag of words?
    2- What are the terms that appears in the 9 classes. I will show them in a cloud tag for frequency.
    3- How is the coverage of the terms that appears as much in one class, as much in two classes, and as much in three classes. 



In [None]:
import pandas as pd
import numpy as np
import os
import json
import nltk, re, math, collections
from nltk.corpus import stopwords
from nltk.corpus import wordnet
import matplotlib.pylab as plt
import operator
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb 
from datetime import datetime
import matplotlib.pyplot as plt 
import seaborn as sns 
from os import path
from wordcloud import WordCloud
import matplotlib.pyplot as plt


In [None]:
train_v = pd.read_csv('../input/training_variants')
test_v = pd.read_csv('../input/test_variants')
train_t = pd.read_csv('../input/training_text',sep='\|\|',skiprows=1,engine='python',names=["ID","Text"])
test_t = pd.read_csv('../input/test_text',sep='\|\|',skiprows=1,engine='python',names=["ID","Text"])

train = pd.merge(train_v, train_t, how='left', on='ID').fillna('')
y_labels = train['Class'].values

test = pd.merge(test_v, test_t, how='left', on='ID').fillna('')
test_id = test['ID'].values

# Let's explore Genes and Variations

First of all, sorry for my english and my code, I'm a beginer with Python. ;)

Let's go to create the function "variationProc". The aim is to get a group of features that apport more information than raw variations and genes.

Without any knowledge about genes I want to try to understand what is the information that they give us...

Taking a view of variations, we can see that there are a lot of different forms in their presentation, so the information could have a lot of noise and probably we get unique variation values for the most of the dataset.



In [None]:
print("there are ",len(train["Variation"]),"rows for the training set")
print("there are ",len(set(list(train["Variation"]))), " different values for variations")
print("there are ",len(set(list(train["Gene"]))), " different values for genes")

Using raw variations as a feature for training wont apport any information to the signal so if we use this, we will only get noise or overfitting if we overtrain the model.

In [None]:
train["Variation"][:50]

A lot of variations take the shape: "letter""number""letter". For example: V468G. 


Exploring the variables in deep we can see that there are others like "truncating", "promoter", "amplification", "wildtype", "deletion", "insertion", etc.

My approach is to create a dataset with the next columns:

    1- Gene: the feature gene. Ex: runx1

    2- Gene2: sometimes there are a second gene implicated in the operation (like in the operation fusion). For example in  the variation runx1	evi1 fusion, evi1 woul be the second gene. 

    3- Operation: operation over feature gene (deletion, insertion, fusion, etc.)

    4- Letter1: in V468G, letter1 would be V

    5- Number1: in V468G, number1 would be 486

    6- Letter2: sometimes in deletion/insertions there are 2 cases of letter-number. For example in the variaiton K745_A750del the letter2 would be A. 

    7- Number2: in K745_A750del, number2 would be 750

    8- ObjLetter: letter-number-Objectiveletter. In the variation V468G the objletter would be G

The function variationProc try to get this dataset. Sorry for the code, I'm sure that it could be better with the use of regex.

In [None]:
def variationProc(variations, genes):
    vari2=[]
    for i in range(0, len(variations)):
        esfusion=False
        texto=variations[i].lower()
        texto = texto.replace(" ","")
        texto = texto.replace("_","#")        
        texto = texto.replace("\'","")
        texto = texto.replace("-","#")
        texto = texto.replace("Exon ","Exon") 
        
        if "truncating" in texto:
            texto="trunc"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "promotermut" in texto:
            texto="promotermut"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "promoterhyper" in texto:
            texto="promoterhyper"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "ampli" in texto:
            texto="ampli"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "overex" in texto:
            texto="overex"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "dnabinding" in texto:
            texto="dnabinding"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "wildtype" in texto:
            texto="wildtype"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "epigeneticsil" in texto:
            texto="epigeneticsil"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "copynumberloss" in texto:
            texto="copynumberloss"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "hypermethyl" in texto:
            texto="hypermethyl"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "singlenucleotidepolymo" in texto:
            texto="singlenucleotidepolymo"+"#null"+"#null"+"#null"+"#null"+"#null"  
        elif "exon" in texto:
            texto=texto+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "fs" in texto:
            texto=texto.replace("fs","")
            if re.match("(\D+)(\d+)(\D+)", texto):
                if texto[1:5].isnumeric():
                    texto="fs"+"#"+texto[0:1]+"#"+texto[1:5]+"#null"+"#null"+"#null"
                elif texto[1:4].isnumeric():                       
                    texto="fs"+"#"+texto[0:1]+"#"+texto[1:4]+"#null"+"#null"+"#null"
                else:                        
                    texto="fs"+"#"+texto[0:1]+"#"+texto[1:3]+"#null"+"#null"+"#null"
            elif re.match("(\D+)(\d+)", texto):
                if " " not in texto:
                    texto="fs"+"#"+texto[0:1]+"#"+texto[1:]+"#null"+"#null"+"#null"
            else:
                texto="fs"+"#"+texto
        elif "deletion/insertion" in texto:
            texto = texto.replace("deletion/insertion","delins")
        elif "delins" in texto:
            texto=texto.replace("delins","")
            if "#" not in texto:
                if re.match("(\D+)(\d+)(\D+)", texto):
                    texto="delins"+"#"+texto[0:1]+"#"+texto[1:4]+"#"+texto[4:4]+"#null"+"#null"+"#null"
                elif re.match("(\D+)(\d+)", texto):
                    texto="delins"+"#"+texto[0:1]+"#"+texto[1:]+"#null"+"#null"+"#null"
                else:
                    texto="delins"+"#null"+"#null"+"#null"+"#null"+"#null"
            else:
                lista=texto.split("#")
                if re.match("(\D+)(\d+)", lista[0]):
                    texto="delins"+"#"+lista[0][0:1]+"#"+lista[0][1:]+"#"
                if re.match("(\D+)(\d+)", lista[1]):
                    texto=texto+lista[1][0:1]+"#"+lista[1][1:4]+"#"+lista[1][4:5]
        elif "fusion" in texto:
            esfusion=True
            texto = texto.replace("fusion","")
            lista=texto.split("#")
            if len(lista)==2:
                if genes[i].lower() in lista[0].lower():
                    texto = lista[1].lower()+"#fusion" +"#null"+"#null"+"#null"+"#null"+"#null"
                else:
                    texto = lista[0].lower()+"#fusion"+"#null"+"#null"+"#null"+"#null"+"#null"
            else:
                texto = genes[i].lower()+"#fusion"+"#null"+"#null"+"#null"+"#null"+"#null"
        elif "deletion" in texto:
            texto = texto.replace("deletion","del")
            texto = texto.replace("3del","del")
        elif "del" in texto:
            texto=texto.replace("del","")
            if "#" not in texto:
                if re.match("(\D+)(\d+)", texto):
                    texto="del"+"#"+texto[0:1]+"#"+texto[1:]
            else:
                lista=texto.split("#")
                texto="del"
                if re.match("(\D+)(\d+)", lista[0]):
                    texto=texto+"#"+lista[0][0:1]+"#"+lista[0][1:]+"#"
                if re.match("(\D+)(\d+)", lista[1]):
                    texto=texto+lista[1][0:1]+"#"+lista[1][1:]+"#"   
                if re.match("(\d+)(\d+)", lista[0]):
                    texto=texto+"#null#"+lista[0] 
                if re.match("(\d+)(\d+)", lista[1]):
                    texto=texto+"#null#"+lista[1]
        elif "insertion" in texto:
            texto = texto.replace("insertion","ins")
        elif "ins" in texto:
            texto=texto.replace("ins","")
            if "#" not in texto:
                if re.match("(\D+)(\d+)(\D+)", texto):
                    if "#" not in texto:
                        texto="ins"+"#"+texto[0:1]+"#"+texto[1:4]+"#"+texto[4:4]
                elif re.match("(\D+)(\d+)", texto):
                    texto="ins"+"#"+texto[0:1]+"#"+texto[1:]
            else:
                lista=texto.split("#")
                texto="ins"
                if re.match("(\D+)(\d+)", lista[0]):
                    texto=texto+"#"+lista[0][0:1]+"#"+lista[0][1:]+"#"
                if re.match("(\D+)(\d+)", lista[1]):
                    texto=texto+lista[1][0:1]+"#"+lista[1][1:4]+"#"+lista[1][4:5]
                if re.match("(\d+)(\d+)", lista[0]):
                    texto=texto+"#null#"+lista[0] 
                if re.match("(\d+)(\d+)", lista[1]):
                    texto=texto+"#null#"+lista[1]
        elif "dup" in texto:
            texto=texto.replace("dup","")
            if " " not in texto:
                if texto[1:].isnumeric():
                    if re.match("(\D+)(\d+)(\D+)", texto):
                        if " " not in texto:
                            texto="dup"+"#"+texto[0:1]+"#"+texto[1:4]+"#"+texto[4:4]
                    elif re.match("(\D+)(\d+)", texto):
                        texto="dup"+"#"+texto[0:1]+"#"+texto[1:]
                else:
                    texto="dup"
            else:
                lista=texto.split("#")
                if re.match("(\D+)(\d+)", lista[0]):
                    texto="dup"+"#"+lista[0][0:1]+"#"+lista[0][1:]+"#"
                if re.match("(\D+)(\d+)", lista[1]):
                    texto=texto+lista[1][0:1]+"#"+lista[1][1:4]+"#"+lista[1][4:5]
        elif "splice" in texto:
            texto=texto.replace("splice","")
            if "#" not in texto:
                if texto[1:].isnumeric():
                    if re.match("(\d+)", texto):
                        texto="splice"+"#null#"+texto+"#null"+"#null"+"#null"
                    elif re.match("(\D+)(\d+)", texto):
                        texto="splice"+"#null#"+texto[1:]+"#null"+"#null"+"#null"
                else:
                    texto="splice"+"#null"+"#null"+"#null"+"#null"+"#null"
            else:
                lista=texto.split("#")
                if re.match("(\D+)(\d+)", lista[0]):
                    texto="splice"+"#null#"+lista[0][1:]
                else:
                    texto="splice"+"#null#"+lista[0]
                    if re.match("(\D+)(\d+)", lista[1]):
                        texto=texto+"#null#"+lista[1][1:]+"#null"
                    else:
                        texto=texto+"#null#"+lista[1]
        elif re.match("(\D)(\d+)(\D+)", texto):
            if " " not in texto:
                texto="sub"+"#"+texto[0:1]+"#"+texto[1:len(texto)-1]+"#null"+"#null"+"#"+texto[len(texto)-1:]
        elif re.match("(\D)(\d+)", texto):
            if " " not in texto:
                texto="sub"+"#"+texto[0:1]+"#"+texto[1:]+"#null"+"#null"+"#null"
        else:
            texto="others"+"#null"+"#null"+"#null"+"#null"+"#null"
        
        if esfusion:
            vari2.append(genes[i].lower()+"#"+texto)
        else:
            vari2.append(genes[i].lower()+"#null#"+texto)
            
    mat=[]   
    for linea in vari2:
        linea.replace(" ","")
        linea.replace("##","#")
        lista=linea.split("#")
        lineaadd=[]
        if len(lista)>4:
            lista[4] = re.sub("\D", "", lista[4])
            if lista[4]=="":
                lista[4]=np.nan
            else:
                float(lista[4])
        if len(lista)>6:
            lista[6] = re.sub("\D", "", lista[6])
            if lista[6]=="":
                lista[6]=np.nan
            else:
                float(lista[6])
        for te in range(0,len(lista)):
            if lista[te] == "null" or lista[te] == "":
                lista[te]=None
                if te==4 or te==6:
                    lista[te]=np.nan
        if len(lista)<8:
            for j in range(len(lista),9):
                if j==4 or j==6:
                    lista.append(np.nan)
                else:
                    lista.append(None)
        
        for j in range(0,8):
            lineaadd.append(lista[j])
        mat.append(lineaadd)
        
    print("Done...")
    return(mat)


In [None]:
print("Processing gene and variation with VariationProc...")
vartra=variationProc(train["Variation"], train["Gene"])

In [None]:
vardf = pd.DataFrame(vartra, columns=["gene1","gene2","operation","letter1","number1","letter2","number2","objletter"])
vardf['Class'] = train["Class"]

We can see that there are a lot of None and NaN values in the gene2, letter2 and number2 columns

In [None]:
vardf

Let's explore the features with plotting...

In [None]:
plt.figure(figsize=(11,7))
sns.countplot(x="Class", data=train)
plt.ylabel('Frequency', fontsize=13)
plt.xlabel('Classes', fontsize=13)
plt.title("Frequency of Classes", fontsize=18)
plt.show()


We have an unbalanced dataset...

In [None]:
plt.figure(figsize=(11,57))
sns.countplot(y="gene1", data=vardf)
plt.ylabel('Gene1', fontsize=13)
plt.xlabel('Frequency', fontsize=13)
plt.title("Frequency of Feature Gene1", fontsize=18)
plt.show()

egfr, tp52, pten, brca1, brca2, braf and kit are the most common genes...

In [None]:
plt.figure(figsize=(10,50))
sns.stripplot(x="Class", y="gene1", data=vardf, jitter=True);
plt.show()

In [None]:
plt.figure(figsize=(10,30))
sns.countplot(y="gene2", data=vardf)
plt.ylabel('Gene2', fontsize=13)
plt.xlabel('Frequency', fontsize=13)
plt.title("Frequency of Feature Gene2", fontsize=18)
plt.show()

There are a few rows that contains gene2. Should we ignore them?

In [None]:
plt.figure(figsize=(10,30))
sns.stripplot(x="Class", y="gene2", data=vardf, jitter=True);
plt.show()

In [None]:
plt.figure(figsize=(11,7))
sns.countplot(y="operation", data=vardf)
plt.ylabel('Operation', fontsize=13)
plt.xlabel('Frequency', fontsize=13)
plt.title("Frequency of Feature Operation", fontsize=18)
plt.show()

The most frequent operation is sub: "letter""number""letter" form variations.

In [None]:
plt.figure(figsize=(10,10))
sns.stripplot(x="Class", y="operation", data=vardf, jitter=True);
plt.show()

In [None]:
plt.figure(figsize=(11,7))
sns.countplot(x="letter1", data=vardf)
plt.ylabel('Frequency', fontsize=13)
plt.xlabel('Letter1', fontsize=13)
plt.title("Frequency of Feature Letter1", fontsize=18)
plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.stripplot(x="Class", y="letter1", data=vardf, jitter=True);
plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.stripplot(x="Class", y="number1", data=vardf, jitter=True);
plt.show()

In [None]:
plt.figure(figsize=(11,7))
sns.countplot(x="letter2", data=vardf)
plt.ylabel('Frequency', fontsize=13)
plt.xlabel('Letter2', fontsize=13)
plt.title("Frequency of Feature Letter2", fontsize=18)
plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.stripplot(x="Class", y="letter2", data=vardf, jitter=True);
plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.stripplot(x="Class", y="number2", data=vardf, jitter=True);
plt.show()

In [None]:
plt.figure(figsize=(11,7))
sns.countplot(x="objletter", data=vardf)
plt.ylabel('Frequency', fontsize=13)
plt.xlabel('ObjLetter', fontsize=13)
plt.title("Frequency of Feature ObjLetter", fontsize=18)
plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.stripplot(x="Class", y="objletter", data=vardf, jitter=True);
plt.show()

# Some Text Mining Features

Let's start with searching the words that are unique in each class.

First of all, I'm going to cluster the different texts in their respective classes.

In [None]:
c1, c2, c3, c4, c5, c6, c7, c8, c9 = "", "", "", "", "", "", "", "", ""

for i in train[train["Class"]==1]["ID"]:
    c1+=train["Text"][i]+" "


for i in train[train["Class"]==2]["ID"]:
    c2+=train["Text"][i]+" "

for i in train[train["Class"]==3]["ID"]:
    c3+=train["Text"][i]+" "
    
for i in train[train["Class"]==4]["ID"]:
    c4+=train["Text"][i]+" "
    
for i in train[train["Class"]==5]["ID"]:
    c5+=train["Text"][i]+" "
    
    
for i in train[train["Class"]==6]["ID"]:
    c6+=train["Text"][i]+" "
    
for i in train[train["Class"]==7]["ID"]:
    c7+=train["Text"][i]+" "
    
for i in train[train["Class"]==8]["ID"]:
    c8+=train["Text"][i]+" "
    
    
for i in train[train["Class"]==9]["ID"]:
    c9+=train_t["Text"][i]+" "
 


Tokenize function split a text in lemmatized tokens.

In [None]:
def tokenize(_str):
    stops = set(stopwords.words("english"))
    tokens = collections.defaultdict(lambda: 0.)
    wnl = nltk.WordNetLemmatizer()
    for m in re.finditer(r"(\w+)", _str, re.UNICODE):
        m = m.group(1).lower()
        if len(m) < 2: continue
        if m in stops: continue
        if m.isnumeric():continue
        m = wnl.lemmatize(m)
        tokens[m] += 1 
    return tokens

In [None]:
texts_for_training=[]
texts_for_test=[]
num_texts_train=len(train)
num_texts_test=len(test)

print("Tokenizing training texts")
for i in range(0,num_texts_train):
    if((i+1)%1000==0):
        print("Text %d of %d\n"%((i+1), num_texts_train))
    texts_for_training.append(tokenize(train["Text"][i]))
    

print("Tokenizing test texts")
for i in range(0,num_texts_test):
    if((i+1)%1000==0):
        print("Text %d of %d\n"%((i+1), num_texts_test))
    texts_for_test.append(tokenize(test["Text"][i]))

In [None]:
print("Tokenizing cluster 1")
cluster1=tokenize(c1)

print("Tokenizing cluster 2")
cluster2=tokenize(c2)

print("Tokenizing cluster 3")
cluster3=tokenize(c3)

print("Tokenizing cluster 4")
cluster4=tokenize(c4)

print("Tokenizing cluster 5")
cluster5=tokenize(c5)

print("Tokenizing cluster 6")
cluster6=tokenize(c6)

print("Tokenizing cluster 7")
cluster7=tokenize(c7)

print("Tokenizing cluster 8")
cluster8=tokenize(c8)

print("Tokenizing cluster 9")
cluster9=tokenize(c9)

uniqsPerClass is a function that returns the bag of words that appears in exactly n classes (if exact parameter is true) or as much in n classes (if exact parameter is false). The number of objective classes is determined by the objective parammeter, and clase is the parameter that contains the cluster to compare with. For example: if we want to find the terms that appears only in the class 1 we use uniqPerClass(cluster1, 1, True); if we want to find the terms of the class5 that appears as much in 3 classes we use uniqPerClass(cluster5,3,False); etc. 

In [None]:
def uniqsPerClass(clase, objective, exact):

    uniqs = collections.defaultdict(lambda: 0.)

    for t, v in clase.items():
        apears=0
        if t in cluster1:
            apears+=1
        if t in cluster2:
            apears+=1
        if t in cluster3:
            apears+=1
        if t in cluster4:
            apears+=1
        if t in cluster5:
            apears+=1
        if t in cluster6:
            apears+=1
        if t in cluster7:
            apears+=1  
        if t in cluster8:
            apears+=1
        if t in cluster9:
            apears+=1
    
        if exact:            
            if apears==objective:
                uniqs[t]=v
        else:
            if apears<(objective+1):
                uniqs[t]=v
    return uniqs


In [None]:
uniC1=uniqsPerClass(cluster1,1,False)
uniC2=uniqsPerClass(cluster2,1,False)
uniC3=uniqsPerClass(cluster3,1,False)
uniC4=uniqsPerClass(cluster4,1,False)
uniC5=uniqsPerClass(cluster5,1,False)
uniC6=uniqsPerClass(cluster6,1,False)
uniC7=uniqsPerClass(cluster7,1,False)
uniC8=uniqsPerClass(cluster8,1,False)
uniC9=uniqsPerClass(cluster9,1,False)


termsComps function takes a tokenized text and returns a list with the proportions of terms shared with the subgroups generated with the function uniqPerClas. It returns the degree of membership of a document to each class taking into account the terms selected by the cited function.

In [None]:
def termsComps(file):
    c1,c2,c3,c4,c5,c6,c7,c8,c9=0.,0.,0.,0.,0.,0.,0.,0.,0.
    for t, v in file.items():
        if t in uniC1:
            c1+=v
        if t in uniC2:
            c2+=v
        if t in uniC3:
            c3+=v
        if t in uniC4:
            c4+=v
        if t in uniC5:
            c5+=v
        if t in uniC6:
            c6+=v
        if t in uniC7:
            c7+=v
        if t in uniC8:
            c8+=v
        if t in uniC9:
            c9+=v
        suma=c1+c2+c3+c4+c5+c6+c7+c8+c9
        if suma==0:
            suma=1
            
    return [c1/suma,c2/suma,c3/suma,c4/suma,c5/suma,c6/suma,c7/suma,c8/suma,c9/suma]

In [None]:
uniqsTextMatr=[]
for file in texts_for_training:
    uniqsTextMatr.append(termsComps(file))

In [None]:
uniqText = pd.DataFrame(uniqsTextMatr, columns=['class'+str(c+1) for c in range(9)])
uniqText['RealClass'] = train["Class"]

In [None]:
uniqText

In [None]:
def precisionT(subclas, realclas, takeNullConsider):
    correct,total=0.,0.
    for i in range(0, len(realclas)):
        if not takeNullConsider:
            if not vacuo(uniqTextList[i][0:9]):
                total+=1
                if uniqTextList[i][0:9].index(max(uniqTextList[i][0:9]))==realclas[i]-1:
                    correct+=1
        else:
            total+=1
            if uniqTextList[i][0:9].index(max(uniqTextList[i][0:9]))==realclas[i]-1:
                correct+=1
    return correct/total

def precisionCoverNull(subclas, realclas,classtocover):
    correct,total=0.,0.
    for i in range(0, len(realclas)):
        if not vacuo(uniqTextList[i][0:9]):
            total+=1
            if uniqTextList[i][0:9].index(max(uniqTextList[i][0:9]))==realclas[i]-1:
                correct+=1
        else:
            total+=1
            if classtocover==realclas[i]:
                correct+=1
    return correct/total


def vacuo(row):
    if row[0]==0.0 and row[1]==0.0 and row[2]==0.0 and row[3]==0.0 and row[4]==0.0 and row[5]==0.0 and row[6]==0.0 and row[7]==0.0 and row[8]==0.0:
        return True
    else:
        return False
    


In [None]:
noinfo=0
for i in range(0,len(uniqText)):
    row=[]
    row.append(uniqText["class1"][i])
    row.append(uniqText["class2"][i])
    row.append(uniqText["class3"][i])
    row.append(uniqText["class4"][i])
    row.append(uniqText["class5"][i])
    row.append(uniqText["class6"][i])
    row.append(uniqText["class7"][i])
    row.append(uniqText["class8"][i])
    row.append(uniqText["class9"][i])
    if vacuo(row):
        noinfo+=1
    
        
print("There are ",len(uniqText)-noinfo, " texts of ",len(uniqText)," in training set that can be classified in their correct class only with the \"unique words per class\" information")

uniqTextList=uniqText.values.tolist()  

forcompare=[]
for i in range(0,len(uniqTextList)):
    forcompare.append(uniqTextList[i][0:9])
    

The precision without taking in consideration the null rows [0,0,0,...,0,0] is of 100%

In [None]:
print(precisionT(forcompare,uniqText["RealClass"],False))

The precision without taking in consideration the null rows [0,0,0,...,0,0] is of aprox 60%

In [None]:
print(precisionT(forcompare,uniqText["RealClass"],True))

In [None]:
uniqText.describe()

Unique words for class 1

In [None]:
def dictotext(dic):
    text=""
    for t,v in dic.items():
        for i in range(0,int(v)):
            text=text+t+" "
    return text            

In [None]:
print("there are ",len(uniC1),"unique words in class1")
text = dictotext(uniC1)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False,).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
print("there are ",len(uniC2),"unique words in class2")
text = dictotext(uniC2)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:

print("there are ",len(uniC3),"unique words in class3")
text = dictotext(uniC3)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
print("there are ",len(uniC4),"unique words in class4")
text = dictotext(uniC4)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
print("there are ",len(uniC5),"unique words in class5")
text = dictotext(uniC5)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
print("there are ",len(uniC6),"unique words in class6")
text = dictotext(uniC6)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
print("there are ",len(uniC7),"unique words in class7")
text = dictotext(uniC7)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
print("there are ",len(uniC8),"unique words in class8")
text = dictotext(uniC8)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

In [None]:
print("there are ",len(uniC9),"unique words in class9")
text = dictotext(uniC9)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

Unique words could help us to improve the signal in some cases

Now we are going to show the words shared by all the classes by frequency maybe they could add noise to the signal

In [None]:
norel=uniqsPerClass(cluster8,9,True)

In [None]:
print("there are ",len(norel),"words that appears in all classes")

text = dictotext(norel)
wordcloud = WordCloud(width=800, height=400, max_font_size=80,collocations = False).generate(text)
plt.figure(figsize=(20,5))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

What about if we include all words that appears in 2 classes

In [None]:
uniC1=uniqsPerClass(cluster1,2,False)
uniC2=uniqsPerClass(cluster2,2,False)
uniC3=uniqsPerClass(cluster3,2,False)
uniC4=uniqsPerClass(cluster4,2,False)
uniC5=uniqsPerClass(cluster5,2,False)
uniC6=uniqsPerClass(cluster6,2,False)
uniC7=uniqsPerClass(cluster7,2,False)
uniC8=uniqsPerClass(cluster8,2,False)
uniC9=uniqsPerClass(cluster9,2,False)

uniqsTextMatr=[]
for file in texts_for_training:
    uniqsTextMatr.append(termsComps(file))
    
uniqText = pd.DataFrame(uniqsTextMatr, columns=['class'+str(c+1) for c in range(9)])
uniqText['RealClass'] = train["Class"]

In [None]:
uniqText


In [None]:
noinfo=0
for i in range(0,len(uniqText)):
    row=[]
    row.append(uniqText["class1"][i])
    row.append(uniqText["class2"][i])
    row.append(uniqText["class3"][i])
    row.append(uniqText["class4"][i])
    row.append(uniqText["class5"][i])
    row.append(uniqText["class6"][i])
    row.append(uniqText["class7"][i])
    row.append(uniqText["class8"][i])
    row.append(uniqText["class9"][i])
    if vacuo(row):
        noinfo+=1
    
        
print("There are ",len(uniqText)-noinfo, " texts of ",len(uniqText)," in training set that can be classified in their correct class only with the \"uniqsPerClass\" function information")

uniqTextList=uniqText.values.tolist()  

forcompare=[]
for i in range(0,len(uniqTextList)):
    forcompare.append(uniqTextList[i][0:9])
    

The precision without taking in consideration the null rows [0,0,0,...,0,0] is of aprox 82%

In [None]:
print(precisionT(forcompare,uniqText["RealClass"],False))

The precision taking in consideration the null rows [0,0,0,...,0,0] is of 67%

In [None]:
print(precisionT(forcompare,uniqText["RealClass"],True))

In [None]:
uniC1=uniqsPerClass(cluster1,3,False)
uniC2=uniqsPerClass(cluster2,3,False)
uniC3=uniqsPerClass(cluster3,3,False)
uniC4=uniqsPerClass(cluster4,3,False)
uniC5=uniqsPerClass(cluster5,3,False)
uniC6=uniqsPerClass(cluster6,3,False)
uniC7=uniqsPerClass(cluster7,3,False)
uniC8=uniqsPerClass(cluster8,3,False)
uniC9=uniqsPerClass(cluster9,3,False)

uniqsTextMatr=[]
for file in texts_for_training:
    uniqsTextMatr.append(termsComps(file))
    
uniqText = pd.DataFrame(uniqsTextMatr, columns=['class'+str(c+1) for c in range(9)])
uniqText['RealClass'] = train["Class"]
uniqText

In [None]:
noinfo=0
for i in range(0,len(uniqText)):
    row=[]
    row.append(uniqText["class1"][i])
    row.append(uniqText["class2"][i])
    row.append(uniqText["class3"][i])
    row.append(uniqText["class4"][i])
    row.append(uniqText["class5"][i])
    row.append(uniqText["class6"][i])
    row.append(uniqText["class7"][i])
    row.append(uniqText["class8"][i])
    row.append(uniqText["class9"][i])
    if vacuo(row):
        noinfo+=1
    
        
print("There are ",len(uniqText)-noinfo, " texts of ",len(uniqText)," in training set that can be classified in their correct class only with the \"uniqsPerClass\" function information")

uniqTextList=uniqText.values.tolist()  

forcompare=[]
for i in range(0,len(uniqTextList)):
    forcompare.append(uniqTextList[i][0:9])

The precision without taking in consideration the null rows [0,0,0,...,0,0] is of aprox 75%

In [None]:
print(precisionT(forcompare,uniqText["RealClass"],False))

The precision taking in consideration the null rows [0,0,0,...,0,0] is of 68%

In [None]:
print(precisionT(forcompare,uniqText["RealClass"],True))

So taking in acount only terms that appears in at maximum 3 classes, the signal could be improved

# To be continued...