# Institutional User Identification with python and Scikit-Learn. 
This Notebook will perform the following tasks:
- Clean Twitter description data.  This functionality (and how exactly we will clean/prepare our raw data can be seen in the `clean` method in the second block
- Test the following models on our dataset:
 - Random Forest
 - Logistic Regression
 - Support Vector Machine (linear kernel)
 - Naive Bayes (Multinomial and Gaussian priors) 
- Save the results of each model to a well-formatted csv file
- Evaluate the models automatically and save the results to a csv file

*The following two blocks of code can be skipped.  Block 1 imports all the necessary packages.  Block 2 contains the* `clean` * function, the *`read_data`* function, and the *`eval_model` *function*.

In [61]:
import pandas as pd
import re, csv
import numpy as np

import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.svm import SVC

In [62]:
def clean(string):
    letters_only = re.sub("[^a-zA-Z#@]", " ", string)

    words = letters_only.split()

    for i in range(0, len(words)):
        if "#" in words[i]:
            s = words[i].split('#')
            words[i] = '# '.join(s)
        if "@" in words[i]:
            s = words[i].split('@')
            words[i] = '@ '.join(s)
        if "http" in words[i]:
            s = words[i].split('http')
            words[i]= "http".join(s)
            
    total_stop_words = set(stopwords.words("english"))
    removed_stop_words = set(stopwords.words("english")[0:20])
    stop_words = total_stop_words - removed_stop_words
    content_words = [w for w in words if not w in stop_words]

    return " ".join(content_words)

def read_data(file):
    data = pd.read_table(file)
    data['Description'] = data.apply(lambda row: clean(row['Description']), axis = 1)
    data['Personal'] = data.apply(lambda row:data['Personal'].astype(int))
    return data

def eval_model(name, classes, predictions):
    TP, TN, FP, FN = 0, 0, 0, 0
    classes = list(classes)
    predictions = list(predictions)
    
    for i in range(0,len(classes)):
        if (classes[i] == 1) & (predictions[i] == 1):
            TP += 1
        if (classes[i] == 0) & (predictions[i] == 0):
            TN += 1
        if (classes[i] == 1) & (predictions[i] == 0):
            FN += 1
        if (classes[i] == 0) & (predictions[i] == 1):
            FP += 1
   
    accuracy= (TP + TN)/(TP + TN + FP + FN)
    precision= (TP)/(TP+FP)
    recall= TP / (TP+FN)
    f_one= 2 * (precision*recall)/(precision + recall)
    
    scores = {
        'Model Name': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1 Score': f_one,
        'Total TP type':TP,
        'Total TN type':TN,
        'Total FP type':FP,
        'Total FN type':FN
    }
    
    return scores 

The following block is divided into 3 parts:

- Lines 1-5: load the data into pandas Dataframes (it does so by calling method above)
- Lines 7-10: vectorize the user descriptions (find a numerical coding for each text-based description)
- Line 12: Declare variable Y as the explanatory variable of the descriptions

In [100]:
train_dir = str() 
test_dir = str() 

training_set= read_data(train_dir)
testing_set = read_data(test_dir)

v = CountVectorizer(analyzer="word", tokenizer=None, preprocessor=None, stop_words= None, max_features=500)

train_vectors = v.fit_transform(training_set['Description'])
test_vectors = v.fit_transform(testing_set['Description'])

Y = training_set['Personal']

### We will now build, train, and test our 5 models. 

In [101]:
LR = LogisticRegression()
LR.fit(train_vectors, Y)

testing_set['Logistic Regression Predictions'] = LR.predict(test_vectors)

In [102]:
RF = RandomForestClassifier(n_estimators=100)
RF.fit(train_vectors, Y)

testing_set['Random Forest Predictions'] = RF.predict(test_vectors)

In [103]:
GNB = GaussianNB()
GNB.fit(train_vectors.toarray(), Y)

testing_set['GaussianNB Predictions'] = GNB.predict(test_vectors.toarray())

In [104]:
MNB = MultinomialNB()
MNB.fit(train_vectors, Y)

testing_set['MultinomialNB Predictions'] = MNB.predict(test_vectors)

In [105]:
SVM = SVC(kernel = 'linear')
SVM.fit(train_vectors, Y)

testing_set['SVM Predictions'] = SVM.predict(test_vectors)

If needed, we can save the results from our models to a csv (this is done in the next block of code). The models' results are saved in the dataframe `testing_set`  and can be viewed by calling it.

In [106]:
testing_set.to_csv('data/ModelOutput/' + train_dir[10:-4] + '+' + test_dir[10:-4] + '_results.csv', index=False)

### Let's construct a table that has the evaulation metrics of our models

In [127]:
ts = testing_set

RF_metric = eval_model('RF', ts['Personal'], ts['Random Forest Predictions'])
LR_metric = eval_model('LR', ts['Personal'], ts['Logistic Regression Predictions'])
GNB_metric = eval_model('GNB', ts['Personal'], ts['GaussianNB Predictions'])
MNB_metric = eval_model('MNB', ts['Personal'], ts['MultinomialNB Predictions'])
SVM_metric = eval_model('SVM', ts['Personal'], ts['SVM Predictions'])

metrics_list = [ RF_metric, LR_metric, GNB_metric, MNB_metric, SVM_metric]
metrics = pd.DataFrame(metrics_list)
metrics.to_csv('data/ModelOutput/' + train_dir[10:-4] + '+' + test_dir[10:-4] + '_metrics.csv', index=False)