# Tumor Mutation Classification

In this project, we will analyze the dataset which contains tumor gene mutations and their risk category, which corresponds to their risk of malignancy in the human condition. The dataset was hand-labeled and released by the team of clinical pathologists at Memorial Sloan Kettering in 2018. Our objective of this project is to fit the dataset into our machine learning models to predict the risk category while accounting for highly unbalanced classes. Several methods for text feature generation will be explored and the resulting features will be reduced using principle component analysis (PCA). We will then use the synthetic minority over-sampling technique (SMOTE) to resample the dataset to make the numbers of categories more even. The last step is to compare the machine learning methods.

In [3]:
import pandas as pd
import boto3
import io

import re
import spacy

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.model_selection import GridSearchCV

nlp = spacy.load('en')
#nlp = spacy.blank('en')
#nlp.add_pipe(nlp.create_pipe('sentencizer'))

# Preprocessing

In this section we will load in all our data and format them to the appropriate data types. The text also needs to be cleaned of all non-legitimate words, such as figure references and parentheses. This will be accomplished using regular expressions.

In [4]:
# Set up packages for loading in data
client = boto3.client('s3') #low-level functional API

resource = boto3.resource('s3') #high-level object-oriented API

In [5]:
# Load in training data labels
obj = client.get_object(Bucket='thinkful-capstone', Key='training_variants')
stream = io.BytesIO(obj['Body'].read())
training_variants = pd.read_csv(stream)

In [6]:
print(training_variants.head())
print(training_variants.shape)

   ID    Gene             Variation  Class
0   0  FAM58A  Truncating Mutations      1
1   1     CBL                 W802*      2
2   2     CBL                 Q249E      2
3   3     CBL                 N454D      3
4   4     CBL                 L399V      4
(3321, 4)


In [9]:
# Load in training data text articles
obj = client.get_object(Bucket="thinkful-capstone",Key="training_text")
raw_training_text = obj["Body"].read()
training_text = raw_training_text.decode('utf-8')
print(training_text[:10000])

ID||Text
0||Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells. The precise mechanisms by which CDK10 modulates ETS2 activity, and more generally the functions of CDK10, remain elusive. Here we demonstrate that CDK10 is a cyclin-dependent kinase by identifying cyclin M as an activating cyclin. Cyclin M, an orphan cyclin, is the product of FAM58A, whose mutations cause STAR syndrome, a human developmental anomaly whose features include toe syndactyly, telecanthus, and anogenital and renal malformations. We show that STAR syndrome-associated cyclin M mutants are unable to interact with CDK10. Cyclin M silencing pheno

In [10]:
# Eliminate references and abbreviations within parentheses
print(len(training_text))
training_text = re.sub(' \(Fig \d+.+?\)', '', training_text)
training_text = re.sub(' \(Fig\. \d+.+?\)', '', training_text)
training_text = re.sub(' \(\d.*?\)', '', training_text)
training_text = re.sub(' \([A-Z]\)', '', training_text)
print(len(training_text))
print(training_text[:4000])

211296707
205598350
ID||Text
0||Cyclin-dependent kinases (CDKs) regulate a variety of fundamental cellular processes. CDK10 stands out as one of the last orphan CDKs for which no activating cyclin has been identified and no kinase activity revealed. Previous work has shown that CDK10 silencing increases ETS2 (v-ets erythroblastosis virus E26 oncogene homolog 2)-driven activation of the MAPK pathway, which confers tamoxifen resistance to breast cancer cells. The precise mechanisms by which CDK10 modulates ETS2 activity, and more generally the functions of CDK10, remain elusive. Here we demonstrate that CDK10 is a cyclin-dependent kinase by identifying cyclin M as an activating cyclin. Cyclin M, an orphan cyclin, is the product of FAM58A, whose mutations cause STAR syndrome, a human developmental anomaly whose features include toe syndactyly, telecanthus, and anogenital and renal malformations. We show that STAR syndrome-associated cyclin M mutants are unable to interact with CDK10. Cycl

In [11]:
# Split text file into list of documents
training_list = training_text.split('||')
training_list = training_list[2:]
print(len(training_list))

3321


In [12]:
# Load training text into dataframe
texts_df = pd.DataFrame(training_list, columns = ['text'])

# Merge text dataframe with labels dataframe
train = pd.concat([training_variants, texts_df], axis=1)
print(train.head())

   ID    Gene             Variation  Class  \
0   0  FAM58A  Truncating Mutations      1   
1   1     CBL                 W802*      2   
2   2     CBL                 Q249E      2   
3   3     CBL                 N454D      3   
4   4     CBL                 L399V      4   

                                                text  
0  Cyclin-dependent kinases (CDKs) regulate a var...  
1   Abstract Background  Non-small cell lung canc...  
2   Abstract Background  Non-small cell lung canc...  
3  Recent evidence has demonstrated that acquired...  
4  Oncogenic mutations in the monomeric Casitas B...  


Now that the data is loaded into a dataframe, let's do some preliminary data exploration.

In [None]:
print(train['Class'].value_counts())

3321
7    953
4    686
1    568
2    452
6    275
5    242
3     89
9     37
8     19
Name: Class, dtype: int64


We have 3321 total datapoints to work with, and it looks like we are dealing with significant class imbalance. Class 7 has 953 datapoints, while Class 8 has only 19. We will have to address this class imbalance with our experiment design. Additionally, the labels have been anonymized, which means we cannot draw any insight about what these classes might signify. 

# Experiment Design

The prevalence of class imbalance has serious implications for our analysis. First and foremost, we must establish our scoring metric. The purpose here is to use the relevant clinical texts to predict the mutation category for each gene/mutation pair. While we want the predictions to be as accurate as possible, simple classification accuracy is not a representative way to judge models that are built on class imbalance, as they may achieve high accuracy by simply predicting the most common class every time. <br>

Given that we are working with a multi-label classifier, the most appropriate scoring metric is Cohen's Kappa coefficient (K). This statistic measures inter-rater agreement for categorical items. Contraty to conventional classification accuracy, the K statistic accounts for the possibility of the agreement occurring by chance. <br>

The Kappa statistic varies from 0 to 1, where 0 = agreement equivalent to chance, and 1 = perfect agreement. We will be choosing models with Kappa statistics closer to 1. We will also look at the precision and recall via the F1 score. Though these are not optimized for multi-label classification, they will be interesting to consider. <br>

We will also try oversampling the lesser-represented categories and apply our machine learning models on the oversampled datasets, judging by their Kappa statistic. Oversampling can be achieved by generating duplicate datapoints or by generating new synthetic datapoints via SMOTE, the Synthetic Minority Oversampling Technique. <br>

I will use various methods of feature generation including classic NLP techniques such as bag-of-words, tf-idf, and n-grams. These methods of feature generation will be applied to both the original and oversampled datasets. They will then be subjected to various machine learning models. Decision trees are known to perform well on unbalanced datasets, so this model may prevail on the original data. However, Naive Bayes is known to perform well on natural langauge, so once the dataset is oversampled it is possible that Naive Bayes will perform the best.

# Data Cleaning

The data is relatively clean already, and containts no NaN values. It needs to be tokenized so it can be processed into readable pieces of data. We will use spaCy to tokenize the data and create a new column with a list of the tokens for each row. Furthermore, we will convert all tokens that are not stop words or punctuation to lemmas to reduce the noise from unnecessary words.

# ### The following will not run due to computing restraints (programmatic spacy tokenization/lemma conversion)

train['spacy_tokens'] = train['text'].apply(lambda x: nlp(x))

def lemmatize(x):
    return [token.lemma_ for token in x
                      if not token.is_stop
                      and not token.is_punct
                      and not token.lemma_ == "-PRON-"
                      and not token.lemma_ == " " or "  "
    ]

train['lemmas'] = train['spacy_tokens'].apply(lambda x: lemmatize(x))
print(train.head())

def lemmatize(x):
    return [token.lemma_ for token in x
                      if not token.is_stop
                      and not token.is_punct
                      and not token.lemma_ == "-PRON-"
                      and not token.lemma_ == " " or "  "
    ]

segment_size = 50
num_segments = int(len(train)/segment_size)
print('segment size', segment_size)
print(num_segments, 'segments')

train_token = pd.DataFrame()
    
for current_segment in range(num_segments - 1):
    start = current_segment * segment_size
    if current_segment == num_segments - 1:
        end = len(train)
    else:
        end = start + (segment_size - 1)

    segment_df = train[start:end]
    segment_df['spacy_tokens'] = segment_df['text'].apply(lambda x: nlp(x))
    train_token.append(segment_df)

also suggested: <br>
nlp = spacy.blank('de') <br>
nlp.add_pipe(nlp.create_pipe('sentencizer'))

# ### Manual workaround

Now this manual workaround doesn't work...it wasn't removing the stop words (which i wanted to do to reduce features/processing time for tf-idf vectorization) so i changed the language from spacy.blank to normal spacy.load and now i'm getting memory error here. attempted to add swap space using the following method: http://blog.nateharada.com/adding-swap-ec2/

In [None]:
# Create a function to filter out stopwords/punctuation and convert to lemma
def lemmatize(x):
    return [token.lemma_ for token in x
                      if not token.is_stop
                      and not token.is_punct
                      and not token.lemma_ == "-PRON-"
                      and not token.lemma_ == " " or "  "
    ]
# we have a problem here because it is separating hyphenated words and it shouldn't be doing that
# also we probably want to keep the original PRONs
# so let's move forward not using lemmas. not worth the hassle

# Create an empty DataFrame to later add spacy tokens
train_token = pd.DataFrame()

df_0 = train[0:100]
df_0['spacy_tokens'] = df_0['text'].apply(lambda x: nlp(x))
df_0['lemmas'] = df_0['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_0, ignore_index=True)

df_100 = train[100:200]
df_100['spacy_tokens'] = df_100['text'].apply(lambda x: nlp(x))
df_100['lemmas'] = df_100['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_100, ignore_index=True)

df_200 = train[200:300]
df_200['spacy_tokens'] = df_200['text'].apply(lambda x: nlp(x))
df_200['lemmas'] = df_200['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_200, ignore_index=True)

df_300 = train[300:400]
df_300['spacy_tokens'] = df_300['text'].apply(lambda x: nlp(x))
df_300['lemmas'] = df_300['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_300, ignore_index=True)

df_400 = train[400:500]
df_400['spacy_tokens'] = df_400['text'].apply(lambda x: nlp(x))
df_400['lemmas'] = df_400['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_400, ignore_index=True)

df_500 = train[500:600]
df_500['spacy_tokens'] = df_500['text'].apply(lambda x: nlp(x))
df_500['lemmas'] = df_500['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_500, ignore_index=True)

df_600 = train[600:700]
df_600['spacy_tokens'] = df_600['text'].apply(lambda x: nlp(x))
df_600['lemmas'] = df_600['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_600, ignore_index=True)

df_700 = train[700:800]
df_700['spacy_tokens'] = df_700['text'].apply(lambda x: nlp(x))
df_700['lemmas'] = df_700['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_700, ignore_index=True)

df_800 = train[800:900]
df_800['spacy_tokens'] = df_800['text'].apply(lambda x: nlp(x))
df_800['lemmas'] = df_800['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_800, ignore_index=True)

df_900 = train[900:1000]
df_900['spacy_tokens'] = df_900['text'].apply(lambda x: nlp(x))
df_900['lemmas'] = df_900['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_900, ignore_index=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

In [None]:
df_1000 = train[1000:1100]
df_1000['spacy_tokens'] = df_1000['text'].apply(lambda x: nlp(x))
df_1000['lemmas'] = df_1000['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1000, ignore_index=True)

df_1100 = train[1100:1200]
df_1100['spacy_tokens'] = df_1100['text'].apply(lambda x: nlp(x))
df_1100['lemmas'] = df_1100['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1100, ignore_index=True)

df_1200 = train[1200:1300]
df_1200['spacy_tokens'] = df_1200['text'].apply(lambda x: nlp(x))
df_1200['lemmas'] = df_1200['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1200, ignore_index=True)

df_1300 = train[1300:1400]
df_1300['spacy_tokens'] = df_1300['text'].apply(lambda x: nlp(x))
df_1300['lemmas'] = df_1300['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1300, ignore_index=True)

df_1400 = train[1400:1500]
df_1400['spacy_tokens'] = df_1400['text'].apply(lambda x: nlp(x))
df_1400['lemmas'] = df_1400['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1400, ignore_index=True)

df_1500 = train[1500:1600]
df_1500['spacy_tokens'] = df_1500['text'].apply(lambda x: nlp(x))
df_1500['lemmas'] = df_1500['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1500, ignore_index=True)

df_1600 = train[1600:1700]
df_1600['spacy_tokens'] = df_1600['text'].apply(lambda x: nlp(x))
df_1600['lemmas'] = df_1600['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1600, ignore_index=True)

df_1700 = train[1700:1800]
df_1700['spacy_tokens'] = df_1700['text'].apply(lambda x: nlp(x))
df_1700['lemmas'] = df_1700['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1700, ignore_index=True)

df_1800 = train[1800:1900]
df_1800['spacy_tokens'] = df_1800['text'].apply(lambda x: nlp(x))
df_1800['lemmas'] = df_1800['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1800, ignore_index=True)

df_1900 = train[1900:2000]
df_1900['spacy_tokens'] = df_1900['text'].apply(lambda x: nlp(x))
df_1900['lemmas'] = df_1900['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_1900, ignore_index=True)

In [None]:
df_2000 = train[2000:2100]
df_2000['spacy_tokens'] = df_2000['text'].apply(lambda x: nlp(x))
df_2000['lemmas'] = df_2000['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2000, ignore_index=True)

df_2100 = train[2100:2200]
df_2100['spacy_tokens'] = df_2100['text'].apply(lambda x: nlp(x))
df_2100['lemmas'] = df_2100['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2100, ignore_index=True)

df_2200 = train[2200:2300]
df_2200['spacy_tokens'] = df_2200['text'].apply(lambda x: nlp(x))
df_2200['lemmas'] = df_2200['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2200, ignore_index=True)

df_2300 = train[2300:2400]
df_2300['spacy_tokens'] = df_2300['text'].apply(lambda x: nlp(x))
df_2300['lemmas'] = df_2300['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2300, ignore_index=True)

df_2400 = train[2400:2500]
df_2400['spacy_tokens'] = df_2400['text'].apply(lambda x: nlp(x))
df_2400['lemmas'] = df_2400['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2400, ignore_index=True)

df_2500 = train[2500:2600]
df_2500['spacy_tokens'] = df_2500['text'].apply(lambda x: nlp(x))
df_2500['lemmas'] = df_2500['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2500, ignore_index=True)

df_2600 = train[2600:2700]
df_2600['spacy_tokens'] = df_2600['text'].apply(lambda x: nlp(x))
df_2600['lemmas'] = df_2600['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2600, ignore_index=True)

df_2700 = train[2700:2800]
df_2700['spacy_tokens'] = df_2700['text'].apply(lambda x: nlp(x))
df_2700['lemmas'] = df_2700['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2700, ignore_index=True)

df_2800 = train[2800:2900]
df_2800['spacy_tokens'] = df_2800['text'].apply(lambda x: nlp(x))
df_2800['lemmas'] = df_2800['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2800, ignore_index=True)

df_2900 = train[2900:3000]
df_2900['spacy_tokens'] = df_2900['text'].apply(lambda x: nlp(x))
df_2900['lemmas'] = df_2900['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_2900, ignore_index=True)

In [None]:
df_3000 = train[3000:3100]
df_3000['spacy_tokens'] = df_3000['text'].apply(lambda x: nlp(x))
df_3000['lemmas'] = df_3000['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_3000, ignore_index=True)

df_3100 = train[3100:3200]
df_3100['spacy_tokens'] = df_3100['text'].apply(lambda x: nlp(x))
df_3100['lemmas'] = df_3100['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_3100, ignore_index=True)

df_3200 = train[3200:3300]
df_3200['spacy_tokens'] = df_3200['text'].apply(lambda x: nlp(x))
df_3200['lemmas'] = df_3200['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_3200, ignore_index=True)

df_3300 = train[3300:]
df_3300['spacy_tokens'] = df_3300['text'].apply(lambda x: nlp(x))
df_3300['lemmas'] = df_3300['spacy_tokens'].apply(lambda x: lemmatize(x))
train_token = train_token.append(df_3300, ignore_index=True)

In [None]:
print(train_token.head())
print(train_token.shape)

# Did not successfully remove stop words
# try with NLTK

# tokenizes much faster and removes stop words but doesn't really lemmatize

In [33]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /home/ubuntu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/ubuntu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/ubuntu/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [34]:
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


def clean_text_nltk(df):
    '''
    # split into words
    tokens = word_tokenize(text)
    # convert to lower case
    tokens = [w.lower() for w in tokens]
    # remove punctuation from each word
    ??? table = str.maketrans('', '', string.punctuation)
    ??? stripped = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    words = [word for word in stripped if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]

    stemmed = [wordnet.lemmatize(word) for word in words]

    return ' '.join(stemmed)
    '''
    
    # split into tokens
    nltk_tokens = df['text'].apply(lambda x: word_tokenize(x))
    
    # convert to lowercase
    lower_tokens = nltk_tokens.apply(lambda x: [word.lower() for word in x])
    
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    no_stop_tokens = lower_tokens.apply(lambda x: [word for word in x if not word in stop_words])
    
    # lemmatize
    lemmatizer = WordNetLemmatizer()
    df['lemmas'] = no_stop_tokens.apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
    
    return df
    
train_nltk = clean_text_nltk(train)
print(train_nltk.head())

   ID    Gene             Variation  Class  \
0   0  FAM58A  Truncating Mutations      1   
1   1     CBL                 W802*      2   
2   2     CBL                 Q249E      2   
3   3     CBL                 N454D      3   
4   4     CBL                 L399V      4   

                                                text  \
0  Cyclin-dependent kinases (CDKs) regulate a var...   
1   Abstract Background  Non-small cell lung canc...   
2   Abstract Background  Non-small cell lung canc...   
3  Recent evidence has demonstrated that acquired...   
4  Oncogenic mutations in the monomeric Casitas B...   

                                         nltk_tokens  \
0  [Cyclin-dependent, kinases, (, CDKs, ), regula...   
1  [Abstract, Background, Non-small, cell, lung, ...   
2  [Abstract, Background, Non-small, cell, lung, ...   
3  [Recent, evidence, has, demonstrated, that, ac...   
4  [Oncogenic, mutations, in, the, monomeric, Cas...   

                                        lower_tok

# NLTK: Convert into a list of strings to feed to tf-idf vectorizer

In [53]:
X = train_nltk['lemmas']
Y = train_nltk['Class']

X_lemma_strings = [
    ' '.join([str(word) for word in text])
    for text in X.values.tolist()
]

In [57]:
from collections import Counter
import itertools

# flatten into list of strings insted of nested lists
merged = set(itertools.chain.from_iterable(X_lemma_strings))
# took over an hour 
print(len(X_lemma_strings))
print(len(merged))

common_unigrams = [Counter(merged).most_common(100000)]
print(common_unigrams[:10])
#tfidf_unigram = tfidf_vectorizer(common_unigrams,Y)
#print(tfidf_unigram.head())

MemoryError: 

# NLTK: Oversample now?

In [58]:
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE 

sm = SMOTE(random_state=42)
X_res, Y_res = sm.fit_sample(X, Y)
print('Resampled dataset shape {}'.format(Counter(y_res)))

ValueError: setting an array element with a sequence.

In [62]:
from imblearn.over_sampling import SMOTE
import numpy as np

X = train_nltk.drop('Class', axis=1).values
Y = train_nltk['Class'].values

print(type(X))

# oversample now?
X_resample, Y_resample = SMOTE().fit_sample(X, Y)
print('The number of transactions after resampling : ' + str(len(X_resample)))
print(X_resample.value_count())

X_lemma_strings = [
    ' '.join([str(word) for word in text])
    for text in X.values.tolist()
]

print(len(X_lemma_strings))

<class 'numpy.ndarray'>


ValueError: setting an array element with a sequence.

# It doesn't like setting array elements with a sequence but i'm using array?

In [61]:
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape {}'.format(Counter(y)))

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_sample(X, y)
print('Resampled dataset shape {}'.format(Counter(y_res)))

print(type(X))

Original dataset shape Counter({1: 900, 0: 100})
Resampled dataset shape Counter({0: 900, 1: 900})
<class 'numpy.ndarray'>


In [None]:
# Establishing our test datasets
# using lemmas to make processing time shorter

X = train_token['lemmas']
Y = train_token['Class']

# oversample now?
X_resample, Y_resample = SMOTE().fit_sample(X, Y)
print 'The number of transactions after resampling : ' + str(len(X_resample))
print(X_resample.value_count())

X_lemma_strings = [
    ' '.join([str(word) for word in text])
    for text in X.values.tolist()
]

print(len(X_lemma_strings))

In [None]:
print(X_lemma_strings[:2])

In [63]:
from collections import Counter
import itertools

# flatten into list of lists insted of nested lists
merged = list(itertools.chain.from_iterable(X_lemma_strings))
# took over an hour 
print(len(X_lemma_strings))
print(len(merged))

common_unigrams = [Counter(X_strings).most_common(10000)]
print(common_unigrams[:10])
#tfidf_unigram = tfidf_vectorizer(common_unigrams,Y)
#print(tfidf_unigram.head())

MemoryError: 

# ### Oversample now?

If we oversample now, we reduce processing time from having to tokenize/lemmatize duplicates, we can just duplicate now? But then it will add more to the tf-idf processing time? Do I only oversample one category? Will SMOTE even work with non-numerical data? <br>

maybe i have to oversample after vectorization? <br>

Using SMOTE technique for oversampling

approach: vectorize, drop features...then oversample?

In [None]:
import imbalanced_learn as imblearn
from imblearn.over_sampling import SMOTE

X = train_token['lemmas']
Y = train_token['Class']

X_resampled, y_resampled = SMOTE().fit_sample(X, y)

# Feature Generation

https://docs.microsoft.com/en-us/machine-learning-server/python-reference/microsoftml/featurize-text 

from microsoftml import rx_logistic_regression, featurize_text, rx_predict
from microsoftml.entrypoints._stopwordsremover_predefined import predefined

makes it incredibly easy but won't install

what about tSNE dimensionality reduction before tf-idf vectorization? they aren't technically features yet... (use PCA first) <br>
https://lvdmaaten.github.io/tsne/

# Creating n-grams

Beyond using singular words or lemmas as features for classification, we can use groupings of words that appear together, as they may convey more meaning than each word isolated by itself. We will create features for bigrams and trigrams, as any groupings larger than 3 words will likely be too specific and may create unnecessary noise.

In [None]:
def ngrams(input_text, n):
  input_text = input_text.split(' ')
  output = []
  for i in range(len(input_text)-n+1):
    output.append(input_text[i:i+n])
  return output

bigrams = []

for text in X_lemma_strings:
    text_bigrams = ngrams(text, 2)
    for bigram in text_bigrams:
        bigrams.append(bigram)
    
print(bigrams[:1])
# needs punctuation and stopwords removed
# currently returns as nested, can we flatten into one list of lists?

In [None]:
print(bigrams[:10])
print(len(bigrams))

In [None]:
from collections import Counter

common_bigrams = [Counter(bigrams).most_common(10000)]
print(common_bigrams[:10])
#tfidf_unigram = tfidf_vectorizer(common_unigrams,Y)
#print(tfidf_unigram.head())

In [None]:
#trigrams = []

#for text in X_strings:
#    text_trigrams = ngrams(text, 3)
#    trigrams.append(text_trigrams)
    # original syntax that returns triple nested lists
    
#print(trigrams[:1])

# TF-IDF Vectorization

In [None]:
def tfidf_vectorizer(X,Y):
    vectorizer = TfidfVectorizer(max_df=0.5,
                                 min_df=2,
                                 stop_words='english', 
                                 lowercase=True,
                                 norm=u'l2',
                                 smooth_idf=True,
                                )

    sparse_tfidf_matrix=vectorizer.fit_transform(X_strings)
    print(f'Number of features: {sparse_tfidf_matrix.get_shape()[1]}')
    
    # Densify matrix so we can convert it to a conventional dataframe to extract X/Y
    dense_tfidf_matrix = sparse_tfidf_matrix.todense()
    df_tfidf = pd.DataFrame(dense_tfidf_matrix)
    df_tfidf['lemma_tokens'] = X
    df_tfidf['class'] = Y
    
    return df_tfidf

tfidf_unigram = tfidf_vectorizer(X_strings,Y)

In [None]:
tfidf_bigram = tfidf_vectorizer(bigrams, Y)

In [None]:
print(tfidf_df.head())

# Machine Learning Methods

Here, we will attempt to classify the training data using several different machine learning classifiers.

# Naive Bayes

In [None]:
gnb = GaussianNB()
gnb.fit(tX_train, tY_train)
test_pred = gnb.predict(tX_test)
print(f'Testing Accuracy: {accuracy_score(tY_test, test_pred)}')
print(f'Cross Val Score: {cross_val_score(gnb, tX_train, tY_train, cv=3).mean()}')
print(pd.crosstab(tY_test, test_pred))

In [None]:
tfidf = tfidf_vectorizer(
                            stop_words='english', 
                            lowercase=True,
                            norm=u'l2',
                            smooth_idf=True
                        )
nb = GaussianNB()

pipe = Pipeline(steps=[('tfidf', tfidf), ('nb', nb)])

tfidf_maxdf = [0.25, 0.5, 0.75]
tfidf_mindf = [2, 5, 20]

param_grid = [
    {
        'tfidf__max_df': tfidf_maxdf,
        'tfidf__min_df': tfidf_mindf,
    }
]

In [None]:
grid = GridSearchCV(pipe, cv=5, n_jobs=1, param_grid=param_grid)
grid.fit(X_train, Y_train)
#print(f'best params:\n {grid.best_params_}')

In [None]:
from sklearn import linear_model, decomposition, datasets

pipe = Pipeline([
    ('feat_gen', tfidf_vectorizer()),
    ('reduce_dim', decomposition.PCA())
    ('classify', GaussianNB())
])

tfidf_maxdf = [0.25, 0.5, 0.75]
tfidf_mindf = [2, 5, 20]

param_grid_1 = [
    {
        'feat_gen': [tfidf_vectorizer(                            
                                        stop_words='english', 
                                        lowercase=True,
                                        norm=u'l2',
                                        smooth_idf=True
                                    )],
        'feat_gen__max_df': tfidf_maxdf,
        'feat_gen__min_df': tfidf_mindf,
        'classify__C': C_OPTIONS
    },
    {
        'feat_gen': [SelectKBest(chi2)],
        'feat_gen__k': N_FEATURES_OPTIONS,
        'classify__C': C_OPTIONS
    },
]

param_grid_2 = dict(feat_gen=[tfidf_vectorizer(), count_vectorizer()],
                    reduce_dim = [PCA(), tSNE()]
                    clf = [GaussianNB()])


grid = GridSearchCV(pipe, cv=3, n_jobs=1, param_grid=param_grid)
grid.fit(X, Y)

##### needs to use a real class not a custom function
##### input must be X list of strings (each string corresponds to a single document) and Y of same size