# Classifying Facebook Resellers
## Machine Learning Classificaiton

In this project, I will build a machine learning model that can classify Facebook posts into sellers, no sellers, and fake sellers. In short, because the unprocessed data had very few useful features to begin with, my main focus was on 
    1. Extracting meta features and interactions between them
    2. Quantifying unique characeristics of each description text through latent semantic analysis. 
    
You either run this code on jupyter notebook or the .py version.

----
## Data Exploration & Preprocessing

*This script was written with Python 3.6.0 on OSX*

You will need to install following Python modules, if you don't have them:
  * pandas
  * numpy
  * re
  * datetime
  * nltk
  * sklearn
  * xgboost (http://xgboost.readthedocs.io/en/latest/build.html)
  
You can run individual cells on Jupyter notebook (https://jupyter.readthedocs.io/en/latest/install.html#new-to-python-and-jupyter)

In [1]:
import pandas as pd
import os
import numpy as np
import re
from datetime import datetime
from nltk.stem import SnowballStemmer
import xgboost as xgb

from sklearn import metrics
from sklearn.preprocessing import Imputer, LabelEncoder, MinMaxScaler
from sklearn.model_selection import KFold, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

# Edit the path to your data if necessary
data_path = 'Data&Data Classification Challenge - Facebook - Training Set.csv'

# load the training data
data = pd.read_csv(data_path, delimiter="\t")

print ('The dataset has {} rows and {} columns'
       .format(data.shape[0], data.shape[1]))
print ('Owner types are {}' .format(set(data['owner_type'])))
print ('Target labels are {}'.format(set(data['INDEX New'])))
print ('Percentage of Fake Seller: {:.2f}%'
       .format(len(data[data['INDEX New'] == 'Fake Seller']) /
               data.shape[0] * 100))
print ('Percentage of Reseller: {:.2f}%'
       .format(len(data[data['INDEX New'] == 'Reseller']) /
               data.shape[0] * 100))
print ('Percentage of No Seller: {:.2f}%'
       .format(len(data[data['INDEX New'] == 'No Seller']) /
               data.shape[0] * 100))
print('There are {} rows that are missing profile_picture'
      .format(len(data[pd.isnull(data.profile_picture)])))
print('There are {} rows that have picture_labels without pictures_url'
      .format(len(data[pd.isnull(data.pictures_url) &
                       pd.notnull(data.picture_labels)])))


The dataset has 35182 rows and 11 columns
Owner types are {nan, 'user', 'page'}
Target labels are {'Fake Seller', 'Reseller', 'No Seller'}
Percentage of Fake Seller: 26.08%
Percentage of Reseller: 27.24%
Percentage of No Seller: 46.69%
There are 0 rows that are missing profile_picture
There are 0 rows that have picture_labels without pictures_url


The original dataset contains the following fields:

  * description- this is mostly English description of products but quite a few of them are non-English text as well.
  * found_keywords

  * found_keywords_occurences
  * nb_like- number of likes
  * nb_share- number of shares
  * owner_type- user or page
  * pictures_url
  * picture_labels
  * INDEX NEW- label, this is what we want to predict
  * profile_picture- link to profile pictures. All posts have profile picture and links do not contain anything particularly useful. 
  * published_at- time when the sale was posted on Facebook

In [2]:
display(data.head(n=5))


Unnamed: 0,description,found_keywords,found_keywords_occurrences,nb_like,nb_share,owner_type,pictures_url,picture_labels,INDEX New,profile_picture,published_at
0,An Alladin's cave of beautiful designer brands...,,0,0,0,user,,,No Seller,https://scontent.xx.fbcdn.net/v/t1.0-1/p50x50/...,02/23/16 04:48 AM
1,Everyone - let me take a minute to clarify som...,,0,29,0,user,https://scontent.xx.fbcdn.net/hphotos-xap1/v/t...,dog,No Seller,https://scontent.xx.fbcdn.net/hprofile-xlt1/v/...,04/02/2016
2,CHANEL QUILTED BACKPACK SMALL 23X26CM MQH WITH...,,0,0,0,user,https://scontent.xx.fbcdn.net/hphotos-xpa1/v/t...,vehicle,Fake Seller,https://scontent.xx.fbcdn.net/hprofile-xfl1/v/...,03/30/2016
3,,,0,25,0,user,https://scontent-ord1-1.xx.fbcdn.net/hphotos-x...,"handbag, handbag, fashion accessory, hood, clo...",Reseller,https://scontent-ord1-1.xx.fbcdn.net/hprofile-...,09/07/2015
4,Longchamp Zip Around Wallet PM / Whatsapp 012-...,,0,1,0,user,https://scontent-ord1-1.xx.fbcdn.net/hphotos-x...,electric blue,Reseller,https://scontent-ord1-1.xx.fbcdn.net/hprofile-...,12/05/2015


In [3]:
# separate target label from feature data which the model will be trained on
label = data['INDEX New']
features = data.drop('INDEX New', axis=1)


A lot of fields in the original dataset do not provide much insight. The code below extracts some meta features from these fields so that our model can better capture the characteristics of data.

  * `published_hour` (int): at what time of the day was the post created?
  * `description_length` (int): word count of `description`
  * `picture_label_occurences` (int): how many picture labels does the post have?
  * `hashtags` (int): how many hashtags does the post have?
  * `punctuations` (int): how many exclamations mark does the post have?
  * `has_contact` (int 1 if True, 0 otherwise): did the author of the post mention any personal contact?
  * `uppercase_count` (int): how many uppercase characters are in `description`?
  * `uppercase_ratio` (float): ratio between uppercase_count and character count of `description`
  * `has_pic_url` (int 1 if True, 0 otherwise): does the post have a `picture_url`?
  

In [4]:
features['published_hour'] = features.published_at.apply(
    lambda x: datetime.strptime(x, '%m/%d/%y %I:%M %p').hour
    if len(x) > 10 else np.nan
)

features['description_length'] = features.description.apply(
    lambda x: len(x.split()) if isinstance(x, str) else 0
)

# add 1 to differentiate NaN from having no lables
features['picture_label_occurrences'] = features.picture_labels.apply(
    lambda x: x.count(',') + 1 if (not isinstance(x, float)) or
                                  (isinstance(x, float) and not np.isnan(x))
    else 0
)

features['hashtags'] = features.description.apply(
    lambda x: x.count('#') if (not isinstance(x, float)) or
                              (isinstance(x, float) and not np.isnan(x))
    else 0
)

features['punctuations'] = features.description.apply(
    lambda x: x.count('!') if (not isinstance(x, float)) or
                              (isinstance(x, float) and not np.isnan(x))
    else 0
)

phone = re.compile(r'\s[[0-9]{9,10}|[0-9]{3,4}\s[0-9]{6,7}]\s')
contact_flag = ['call', 'contact', '@', 'whatsapp',
                'text', 'message', 'pm', phone]
features['has_contact'] = features.description.apply(
    lambda x: int(any(bool(re.search(s, re.sub(r'[\-|\+|\(|\)|\.|\,]',
                                               '', x.lower())))
                      for s in contact_flag))
    if (not isinstance(x, float)) or (isinstance(x, float) and not np.isnan(x))
    else 0
)

# add 1 to differentiate NaN from having no uppercase letters
features['uppercase_count'] = features.description.apply(
    lambda x: (sum(1 for c in x if c.isupper())) + 1
    if not isinstance(x, float)
    else 0
)

# add 1 to differentiate NaN from having no uppercase letters
features['uppercase_ratio'] = features.description.apply(
    lambda x: (sum(1 for c in x if c.isupper()) + 1) / len(x)
    if not isinstance(x, float)
    else 0
)

features['has_pic_url'] = features.pictures_url.apply(
    lambda x: 0 if not isinstance(x, float) else 1
)

# remove profile_picture, pictures_url, and
# published_at since they are not informative
features = features.drop('profile_picture', axis=1) \
    .drop('pictures_url', axis=1) \
    .drop('published_at', axis=1)


Notice how skewed the distribution of numeric features are. Originally, I had performed logarithmic transformation to change the shape of distribution and make the model more robust to extreme values. However, the model turned out to perform a lot better when you don't do transformation. This is probably due to the fact that having an extreme number of likes and shares, for example, is a rather normal phenomenon in Facebook, particularly for popular posts or official pages. Therefore, although they are outliers, this extremity can be quite informative.

In [5]:
display(features.describe())

Unnamed: 0,found_keywords_occurrences,nb_like,nb_share,published_hour,description_length,picture_label_occurrences,hashtags,punctuations,has_contact,uppercase_count,uppercase_ratio,has_pic_url
count,35182.0,35182.0,35182.0,34015.0,35182.0,35182.0,35182.0,35182.0,35182.0,35182.0,35182.0,35182.0
mean,0.301063,617.11233,437.5942,10.903307,30.385339,1.371667,0.36135,0.685748,0.193195,20.409556,0.129761,0.217981
std,1.443882,9420.10791,12891.64,6.944151,99.697575,5.586411,3.35685,2.566528,0.394811,56.386471,0.180818,0.41288
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,5.0,3.0,0.0,0.0,0.0,0.0,2.0,0.028855,0.0
50%,0.0,0.0,0.0,10.0,10.0,0.0,0.0,0.0,0.0,6.0,0.072993,0.0
75%,0.0,4.0,0.0,17.0,24.0,1.0,0.0,0.0,0.0,16.0,0.148148,0.0
max,94.0,672593.0,1444213.0,23.0,5414.0,89.0,245.0,77.0,1.0,2972.0,2.0,1.0


We want to fill in as many NaN values as possible. Therefore, I replaced NaNs in `published_hour` with the most frequent value in the column.

In [6]:
# impute nan values in published_hour
hour_imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=1)
hour_imp.fit(features.published_hour.values.reshape(1, -1))
features.published_hour = sum(hour_imp.transform(
    features.published_hour.values.reshape(1, -1)).tolist(), [])


In the following code, I encoded all categorial variables. The column of `owner_type` is now split into two columns, `user` and `page`, each containing 0s and 1s to indicate which owner type a post has. Data label is also encoded into numbers.

Now the labels are

  * `Fake Seller`: 0
  * `No Seller`: 1
  * `Re Seller`: 2

In [7]:
# one hot encode categorial variable
encoded_owner = pd.get_dummies(features.owner_type)
features = pd.concat([features, encoded_owner], axis=1)

# drop uninformative features
features = features.drop('owner_type', axis=1) \
    .drop('found_keywords', axis=1) \
    .drop('picture_labels', axis=1)

# change label categories from string to integer
LE = LabelEncoder()
label = LE.fit_transform(label)


Now here is how the feature data looks like. 

In [8]:
display(features.head(n=5))


Unnamed: 0,description,found_keywords_occurrences,nb_like,nb_share,published_hour,description_length,picture_label_occurrences,hashtags,punctuations,has_contact,uppercase_count,uppercase_ratio,has_pic_url,page,user
0,An Alladin's cave of beautiful designer brands...,0,0,0,4.0,33,0,0,1,0,8,0.037915,1,0,1
1,Everyone - let me take a minute to clarify som...,0,29,0,6.0,111,1,0,0,0,20,0.032206,0,0,1
2,CHANEL QUILTED BACKPACK SMALL 23X26CM MQH WITH...,0,0,0,6.0,9,1,0,0,0,40,0.714286,0,0,1
3,,0,25,0,6.0,0,5,0,0,0,0,0.0,0,0,1
4,Longchamp Zip Around Wallet PM / Whatsapp 012-...,0,1,0,6.0,8,1,0,0,1,8,0.150943,0,0,1


---
## Natural Language Processing: Extracting Text Features

In the following section, I will illustrate how I extracted useful features from the `description` field.

For the purpose of demonstration, I randomly split the data into training (80%) and test (20%) datasets. This is not how I actually validated my model.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(
    features, label, test_size=0.2, random_state=0)

# Show the results of the split
print ("Training set has {} samples.".format(X_train.shape[0]))
print ("Testing set has {} samples.".format(X_test.shape[0]))


Training set has 28145 samples.
Testing set has 7037 samples.


Then, I from the traing set, I created a `TfidfVectorizer` object to extract tf-idf (term frequency–inverse document frequency) to qunatify the importance of each word in the description field. In short, this feature numerically represents how often a given word appears in a description, but downsize its weight by how often it appears across the entire description data. Therefore, uninformative words that appear in a lot of descriptions will have a relatively low number. 

After fitting `TfidfVectorizer` on the training description text, I applied the same object to description in test data, extracting tf-idf of words that appear in the training set. 

In [10]:
def preprocessor(text):
    processed = re.sub(r'[#|\!|\-|\+|:|//|\']', "", text)
    processed = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)', ' ', processed).strip()
    processed = re.sub('[\s]+', ' ', processed).strip()
    processed = " ".join([SnowballStemmer("english").stem(word)
                          for word in processed.split()])
    return processed


description_vectorizer = TfidfVectorizer(sublinear_tf=True,
                                         ngram_range=(1, 2),
                                         preprocessor=preprocessor)
# fit on description in the training set
description_vectorizer.fit(X_train.description.values.astype('U'))
tfidf_train = description_vectorizer.transform(
    X_train.description.values.astype('U'))

# extract tf-idf from the test set
tfidf_test = description_vectorizer.transform(
    X_test.description.values.astype('U'))


In [11]:
print ("The sparse document matrix has {} tfi-df features"
       .format(tfidf_train.shape[1]))

The sparse document matrix has 343080 tfi-df features


Because there are way too many features, which will make the model overfit to training data in the end, I attempted to extract _actually_ informative features. To do so,  I implemented Latent Semantic Analysis (LSA), by applying Singular Value Decomposition (SVD) to reduce the dimension of the matrix. The resultant tfi-df matrix now has 150 components.

In hindsight, this model could have had a lot cleaner data if I could deal with foreign text better. Although most posts are in English there are quite a bit of non-English text as well. The ideal thing to do would have been translating these foreign text to English before tf-idf analysis so that we can reduce the number of unique words in the data and capture significant words in non-English posts. I tried to find free API to do this, but because Google has stopped giving free access to its API, most python modules have died out. 


In [12]:
# dimension reduction: LSA
tfidf_lsa = TruncatedSVD(n_components=150)
# fit TruncatedSVD on the training data.
reduced_tfidf_train = tfidf_lsa.fit_transform(tfidf_train)
# reduce dimension of the test data
reduced_tfidf_test = tfidf_lsa.transform(tfidf_test)


Time to merge the features from numerical variables with text features. I also scaled the data so that number ranges wouldn't be too different across different features. 

Originally, I also tried PCA on non-text features, but the model ended up having much better results without PCA. 

In [13]:
# no need to keep description anymore. Remove!
decomposed_X_train = X_train.drop('description', axis=1)
decomposed_X_test = X_test.drop('description', axis=1)

# combine text features with the rest.
feature_train = np.hstack((decomposed_X_train, reduced_tfidf_train))
feature_test = np.hstack((decomposed_X_test, reduced_tfidf_test))

# initialize a normalizer, then apply it to the features
feature_scaler = MinMaxScaler()
feature_scaler.fit(feature_train)
feature_train = feature_scaler.transform(feature_train)
feature_test = feature_scaler.transform(feature_test)


---
## Modeling

Time for modeling! I put the entire section of text feature extraction into a function so that I can easily loop through different folds during the validation process.

In [9]:
def preprocessor(text):
    processed = re.sub(r'[#|\!|\-|\+|:|//|\']', "", text)
    processed = re.sub(r'(?:(?:\d+,?)+(?:\.?\d+)?)', ' ', processed).strip()
    processed = re.sub('[\s]+', ' ', processed).strip()
    processed = " ".join([SnowballStemmer("english").stem(word)
                          for word in processed.split()])
    return processed


def extract_feature(train_idx, test_idx, feature_df):
    X_train = feature_df.iloc[train_idx]
    X_test = feature_df.iloc[test_idx]

    # tf-idf
    description_vectorizer = TfidfVectorizer(sublinear_tf=True,
                                             ngram_range=(1, 2),
                                             preprocessor=preprocessor)
    description_vectorizer.fit(X_train.description.values.astype('U'))
    tfidf_train = description_vectorizer.transform(
        X_train.description.values.astype('U'))
    tfidf_test = description_vectorizer.transform(
        X_test.description.values.astype('U'))

    # LSA
    tfidf_lsa = TruncatedSVD(n_components=150, random_state=0)
    reduced_tfidf_train = tfidf_lsa.fit_transform(tfidf_train)
    reduced_tfidf_test = tfidf_lsa.transform(tfidf_test)

    # combine text and non-text features
    feature_train = np.hstack((X_train.drop('description', axis=1),
                              reduced_tfidf_train))
    feature_test = np.hstack((X_test.drop('description', axis=1),
                             reduced_tfidf_test))

    # scale features
    feature_scaler = MinMaxScaler()
    feature_scaler.fit(feature_train)
    feature_train = feature_scaler.transform(feature_train)
    feature_test = feature_scaler.transform(feature_test)

    return feature_train, feature_test



For validation, I used K-Fold Validation with 5 folds. During the testing process, I noticed that it is absolutely important that I _shuffle_ the data before splitting it into folds. It turned out that there are multiple posts with the same description or very similar advertisement. If you don't shuffle the data at all, the training and test data will each take the clusters of very similar posts and end up having a skewed representation of data about particular brands.

In [10]:
def validate_model(clf_model, features, label):
    """
        clf_model- model object
        feaures- original feature dataset
        label- original label dataset
    """
    model_result = []
    kf = KFold(n_splits=5, shuffle=True, random_state=0)
    i = 0
    for train, test in kf.split(features.values):
        i = i + 1
        print ('Fold {}'.format(i))
        train_final, test_final = extract_feature(train, test, features)
        clf_model.fit(train_final, label[train])
        output = clf_model.score(test_final, label[test])
        model_result.append(output)
        # accuracy of individual folds
        print ('Accuracy:{}'.format(output))

    print ('Overall Accuracy: {}'.format(np.mean(model_result)))



I tried several different classifiers: Decision Tree, Random Forest, Support Vector Machine, Multi-layer Perceptron, Random Forest, AdaBoost, Gradient Boosting, and eXtreme Gradient Boosting. After many rounds of tweaking different parameters, I found the following classifiers to be most accuracte:

  * Random Forest Classifier
      > because the training data has quite a bit of features (165 different features) and the model has a high risk of overfitting on particular text features, I used 100 trees (estimators) with no depth. In selecting the maximum number of features at each split, I followed the rule of thumb and took the square root of the number of input fearture.
      
  * Gradient Boosting with Decision Tree Regressor
      > I picked 0.1 as learning rate so that the model can generalize better. In return, I also used 150 boosting estimators (trees) to model all possible relation at such a low learning rate. I  limited maximum depth to 10 to prevent overfitting as well. Finally, subsamping is set at 0.8 to reduce variance. 
      
  * AdaBoost with Random Forest as a base estimator
      > Instead of using decision tree, I used random forest classifier as a base learner because it seemed to generalize much better than the former. But because it was so computationally expensive, I set max_depth at 5 (as opposed to none above). I wanted to set learning rate low so ended up using 300 estimators.
      
  * XGBoost
      > Like above algorithms, I set learning rate at 0.1 to prevent overfitting and number of boosting rounds at 300 to compensate for the low learning rate. I subsampled 80% of data, also to prevent overfitting. Objective is tuned to be multi:softmax to allow non-binary classification.

They all reached similar accuracies, but among them,  XGBoost seemed to be the most practical one. Not only is it slightly more accurate than the other 3, but XGBoost model is also much more efficient than Adaboost and Gradient Boosting. AdaBoost and Gradient Boost take ~30 minutes to train the model (thus taking around 1.5h to test all 5 folds), and therefore all things considered, XGBoost had the best performance overall. 

### Random Forest

In [13]:
clf = RandomForestClassifier(n_estimators=100, max_depth=None,
                             min_samples_split=2, random_state=0)
validate_model(clf, features, label)


Fold 1
Accuracy:0.8557623987494671
Fold 2
Accuracy:0.8468097200511582
Fold 3
Accuracy:0.8473564525298465
Fold 4
Accuracy:0.8480670835702103
Fold 5
Accuracy:0.8521887436043206
Overall Accuracy: 0.8500368797010005


### Gradient Boosting

In [15]:
clf = GradientBoostingClassifier(n_estimators=150, learning_rate=0.1,
                                 subsample=0.8, max_depth=10,
                                 max_features='auto', random_state=0)
validate_model(clf, features, label)


Fold 1
Accuracy:0.8577518829046469
Fold 2
Accuracy:0.8584624129600682
Fold 3
Accuracy:0.8590108015918135
Fold 4
Accuracy:0.8563104036384309
Fold 5
Accuracy:0.8524729960204662
Overall Accuracy: 0.8568016994230853


### AdaBoost

In [18]:
clf = AdaBoostClassifier(base_estimator=RandomForestClassifier(max_depth=5),
                         n_estimators=300, learning_rate=0.1, random_state=0)
validate_model(clf, features, label)


Fold 1
Accuracy:0.8448202358959784
Fold 2
Accuracy:0.834588603097911
Fold 3
Accuracy:0.8344229675952246
Fold 4
Accuracy:0.8374076179647527
Fold 5
Accuracy:0.8352757248436612
Overall Accuracy: 0.8373030298795057


## XGBoost

In [24]:
# list to store results
model_result = []
predictions = []
actuals = []

# create validatoin set
kf = KFold(n_splits=5, shuffle=True, random_state=0)
i = 0
for train, test in kf.split(features.values):
    i = i + 1
    print ('Fold {}'.format(i))
    train_final, test_final = extract_feature(train, test, features)
    
    # collect true labels from each fold
    actuals = actuals + label[test].tolist()

    # prepare the dataset for xgboost
    dtrain = xgb.DMatrix(train_final, label=label[train])
    dtest = xgb.DMatrix(test_final, label=label[test])
    
    # parameter setting
    param = {'max_depth': 7, 'eta': 0.1, 'objective': 'multi:softmax',
             'num_class': 3, 'subsample': 0.8, 'seed': 0}
    num_round = 300
    clf_xgb = xgb.train(param, dtrain, num_round)
    
    # make prediction and store it to predictions list
    preds = clf_xgb.predict(dtest)
    predictions = predictions + preds.tolist()

    # accuracy score
    output = metrics.accuracy_score(label[test], preds)
    model_result.append(output)
    # accuracy of individual folds
    print ('Accuracy:{}'.format(output))

print ('Overall Accuracy: {}'.format(np.mean(model_result)))


Fold 1
Accuracy:0.8621571692482592
Fold 2
Accuracy:0.8613045331817536
Fold 3
Accuracy:0.8560261512222854
Fold 4
Accuracy:0.8615690733371234
Fold 5
Accuracy:0.8598635588402501
Overall Accuracy: 0.8601840971659342


## XGBoost Model Performance

In [36]:
print(metrics.classification_report(actuals, predictions, target_names = LE.classes_.tolist()))


             precision    recall  f1-score   support

Fake Seller       0.87      0.81      0.84      9174
  No Seller       0.87      0.90      0.89     16425
   Reseller       0.83      0.84      0.83      9583

avg / total       0.86      0.86      0.86     35182

