<h1 style="text-align:center">Module 6 Assessment</h1>

Welcome to your Mod 6 Assessment. You will be tested for your understanding on concepts and ability to programmatically solve problems that have been covered in class and in the curriculum. Topics in this assessment include graph theory, natural language processing, and neural networks. 

You will have up to 90 minutes to complete this assessment. 

## Natural Language Processing Assessment

In this exercise we will attempt to classify text messages as "SPAM" or "HAM" using TF-IDF Vectorization. Complete the functions below and answer the question(s) at the end. 

Import necessary libraries (we've imported some for you)

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

Read in data

In [None]:
df_messages = pd.read_csv('data/spam.csv', usecols=[0,1])

Convert string labels to 1 or 0 

In [None]:
le = LabelEncoder()
df_messages['target'] = le.fit_transform(df_messages['v1'])

### Tokenize

1. Create a function that takes in a string and returns a list of tokens or words

In [None]:
def tokenize(string):
    '''
    Tokenizes a string
    
    Parameters
    ----------
    string: str object
        String object to tokenize
    Returns
    --------
    tokens : list
        A list containing each word in string as an element 

    '''
    pass

### Remove Stopwords

2. Create a function to remove stopwords and punctuation from a list of tokens

In [None]:
def remove_stopwords(tokens): 
    '''
    Removes stopwords from a list of tokens (words)
    
    Parameters
    ----------
    tokens: list object
        List of tokens that need stopwords removed
    Returns
    --------
    stopwords_removed : list object
        A list containing tokens with stopwords removed

    '''
    pass

Apply tokenization and stop word removal to our dataset 

In [None]:
df_messages['tokens'] = df_messages['v2'].apply(lambda x: tokenize(x))
df_messages['stopwords_removed'] = df_messages['tokens'].apply(lambda x: remove_stopwords(x))

### Most Common Words

3. Create a function that outputs the frequency of the n most common words

In [None]:
def frequency_distribution(tokens, n):
    '''
    Get n most common words in a Series of tokens
    
    Parameters
    ----------
    tokens: pandas.Series object
        Pandas series of tokens 
    n : int object
        Integer defining the number of most common words to return
    Returns
    --------
    most_common : list object
        An array of tuples containing word frequency for n most common words

    '''
    pass

In [None]:
frequency_distribution(df_messages['stopwords_removed'], 10)

### TF-IDF

4. Generate tf-idf vectorization for our data (split data into train and test here)

In [None]:
def tfidf(X, y): 
    '''
    Generate train and test TF-IDF vectorization for our data set
    
    Parameters
    ----------
    X: pandas.Series object
        Pandas series of text documents to classify 
    y : pandas.Series object
        Pandas series containing label for each document
    Returns
    --------
    tf_idf_train :  sparse matrix, [n_train_samples, n_features]
        Vector representation of train data
    tf_idf_test :  sparse matrix, [n_test_samples, n_features]
        Vector representation of test data
    y_train : array-like object
        labels for training data
    y_test : array-like object
        labels for testing data
    vectorizer : vectorizer object
        fit TF-IDF vecotrizer object

    '''
    
    pass


Run tfidf()

In [None]:
tf_idf_train, tf_idf_test, y_train, y_test, vecotorizer = tfidf(df_messages['v2'], df_messages['target'])

### Classification

Initialize classifiers

In [None]:
nb_classifier = MultinomialNB()
rf_classifier = RandomForestClassifier(n_estimators=100)

5. Create a function that takes in a classifier and trains it on our tf-idf vectors and generates test and train predictiions

In [None]:
def classify_text(classifier, tf_idf_train, tf_idf_test, y_train):
    '''
    Train a classifier to identify whether a message is spam or ham
    
    Parameters
    ----------
    classifier: sklearn classifier
       initialized sklearn classifier (MultinomialNB, RandomForestClassifier, etc.)
    tf_idf_train : sparse matrix, [n_train_samples, n_features]
        TF-IDF vectorization of train data
    tf_idf_test : sparse matrix, [n_test_samples, n_features]
        TF-IDF vectorization of test data
    y_train : pandas.Series object
        Pandas series containing label for each document in the train set
    Returns
    --------
    train_preds :  list object
        Predictions for train data
    test_preds :  list object
        Predictions for test data
    '''
    pass

Generate predictions for Naive Bayes Classifier

In [None]:
nb_train_preds, nb_test_preds = classify_text(nb_classifier,tf_idf_train, tf_idf_test, y_train)

In [None]:
print(confusion_matrix(y_test, nb_test_preds))
print(accuracy_score(y_test, nb_test_preds))


Generate predictions for Random Forest Classifier

In [None]:
rf_train_preds, rf_test_preds = classify_text(rf_classifier,tf_idf_train, tf_idf_test, y_train)

In [None]:
print(confusion_matrix(y_test, rf_test_preds))
print(accuracy_score(y_test, rf_test_preds))

This function returns the word with the highest TF-IDF value in a given documnet

In [None]:
def get_max_tf_idf(tf_idf, doc):
    '''
    Get word with highest TF-IDF value in a document
    
    Parameters
    ----------
    tf_idf : spare matrix 
        TF-IDF vectorization of text data
    doc : int object
        Index of document in vectorization to get max tf-idf for
    --------
    max_tf_idf : str object
        Word with highest TF-IDF value in a document
    '''
    x = tf_idf[doc].toarray()
    max_tf_idf = vecotorizer.get_feature_names()[np.where(x[0] == max(x[0]))[0][0]]
    return max_tf_idf

### Explain

6. The word schools has the highest TF-IDF value in the second document of our test data. What does that tell us about the word school? 


In [None]:
get_max_tf_idf(tf_idf_test, 1)

## Network Analysis Assessment

For these next questions, you'll be using a graph dataset of facebook users and networkx. In the next cell, we're going to read in the dataset.

In [31]:
import networkx as nx
G = nx.read_edgelist('./data/0.edges')

1. Create a function `find_centrality` that returns a dictionary with the user with the highest betweenness centrality and the user with the highest degree centrality. It should return a dictionary that looks like:


{'bc' : |user|, 'dc' : |user|}

How does each of these people wield influence on the network? Imagine a message had to get to people from different communities. Who would be the best user to deliver the message to ensure that people from opposite communities receive the message?

In [32]:
def find_centrality(graph):
    """
    Calculates the most central nodes on a graph
    
    Parameters
    ----------
    graph: networkx Graph object
        Graph object to be analyzed
    Returns
    --------
    centrality_dict : dict
        A dictionary with the highest ranked user based off degree centrality and betweenness centrality 
    """
    pass

2. A marketing group is looking to target different communities with advertisements based off of their assumed mutual interests. Use the k_cliques_communities method to calculate the number of cliques formed with k users in a function `find_k_communities`. Calculate how many communities there are if the minimum size of a clique is 5.


In [34]:
def find_k_communities(graph,k):
    """
    Parameters
    ----------
    graph: networkx Graph object
        
    k : int
        k-number of connections required for a clique
    
    Returns
    -------
    num_communities: int
        The number of communities present in the graph
    """
    pass