Carpooling is a very popular means of long distance transportation for students in Waterloo. However, searching for carpools in the numerous Waterloo Carpool Facebook groups is tedious. This is because all searches must be performed via "Ctrl f", all groups must be searched individually, and since only a limited number of posts appear on the page until you scroll down. Thus, we have designed a website to search the following Facebook groups for carpools:

## Search Parameters

It would be ideal to be able to search for carpool posts based on whether you are offering or searching for a carpool and the origin, destination and date of the carpool. So we need to develop a method for extracting this information from a post.

## Post, Trips, Groups:

Every post in a Facbook group is referred to as a post and is assigned a unique ID, post_id. A trip is a subset of a post and is determined by a single origin, a single destination, a time, a date and a trinary classification of whether the post is a driving, searching or other post. A driving post is if someone is offering a carpool, a searching post is if someone is searching for a carpool and an other post is an unrelated post. For example, the following post: "Driving from Waterloo to Toronto at 8:00am and back at 8:00pm today" contains two trips (Driving from Waterloo to Toronto at 8:00am today & Driving from Toronto to Waterloo at 8:00pm today). Users are given the ability to search for certain trips, and all posts containing those trips are displayed. Each trip is assigned a unique id, trip_id. Only trips from the same post will have the same post_id. These concepts are not enough, however. Since we are extracting information from different Facebook groups, it is possible that numerous groups could both contain the exact same post. When read into the database, this post will appear more than once with different post_ids each time. To deal with this, we introduce groups. A group is a collection of all posts deemed to be duplicates. Each group has a unique group_id. It follows that only posts that are duplicates will have the same group_id and only trips from posts that are duplicates will have the same group_id. Upon searching for trips, having groups allows us to gather duplicates from and display them appropriately.

## Driving, Searching and Other Classification

We need to discover whether a post is a "driving" post (a carpool is being offered), a "searching" post (a carpool is being requested) or an "other" post (a completetly unrelated post). Discovering whether a post is a driving, searching or an other post is a trinary classification problem, and so we can use a supervised learning algorithm.

To do this, however, we need some training data. So we manually classified 1366 posts as driving, searching or other posts. Below is a preview of the data where post_type is the classification and stage_3 is a feature, the processed text.

In [1]:
import pandas as pd
from classify.db import engine
derived_posts = pd.read_sql_query('SELECT * from derived_posts', con=engine)

In [2]:
data = pd.read_sql_query('SELECT A.post_id, A.post_type, A.route_count, B.stage_3 FROM manual_posts A LEFT JOIN derived_posts B ON (A.post_id = B.post_id) ', con=engine)
data[["post_type", "stage_3"]].head()

Unnamed: 0,post_type,stage_3
0,d,"driving,waterloo,to,markham,sunday,may,27,at,3..."
1,d,"driving,toronto,to,waterloo,sunday,may,27th,at..."
2,d,"leaving,waterloo,to,toronto,7pm,today"
3,d,"driving,scarborough,to,waterloo,at,8am,monday,..."
4,d,"driving,fairview,mall,to,waterloo,sunday,may,2..."


In [3]:
import nltk
import numpy
import string
import re
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold,cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

def split_by_comma(x):
    return x.split(",")

Currently, stage_3 is the only feature. The problem, however, is that stage_3 is a string and cannot be used directly. So we proceed by vectorizing the text. We'll test the TFID, count and 2-gram vectorizers to see which one yields better results. 

Furthermore, we are going proceed by fitting a random forest model. To choose the hyper parameters we will perform a grid search over the number of estimators as 50,150,300 and the max depth as 60,90, None. The grid search parameters are chosen to given us an idea on how the model results change as we change the parameters.

In [4]:
rf = RandomForestClassifier()
param = {'n_estimators': [50, 150,300], 'max_depth': [60,90,None]}
gs = GridSearchCV(rf, param, cv = 5, n_jobs = -1, return_train_score= True)

Using the TFIDF vectorizer:

In [5]:
tfidf_vect = TfidfVectorizer(analyzer = split_by_comma)
X_tfidf = tfidf_vect.fit_transform(data["stage_3"])
X_tfidf_feat = pd.DataFrame(X_tfidf.toarray())
X_tfidf_feat.columns = tfidf_vect.get_feature_names()

Below is a preview of the TFIDF vectorizer features

In [6]:
X_tfidf_feat.head()

Unnamed: 0,$10,$15,$150,$20,$20pm,$30,$40,$50,$75,00am,...,work,workspace,world,would,writing,wu,www,year,york,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.311852,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Below are the results of the grid search using the TFIDF vectorizer features.

In [7]:
gs_fit_tfidf = gs.fit(X_tfidf_feat, data["post_type"])
pd.DataFrame(gs_fit_tfidf.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.186291,0.030523,0.007586,0.001841,60.0,50,"{'max_depth': 60, 'n_estimators': 50}",0.948905,0.963504,0.970803,...,0.965593,0.009638,9,1.0,1.0,1.0,1.0,1.0,1.0,0.0
1,0.572471,0.04415,0.017952,0.003025,60.0,150,"{'max_depth': 60, 'n_estimators': 150}",0.956204,0.974453,0.974453,...,0.97511,0.010638,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
2,1.093277,0.044281,0.031716,0.001323,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.959854,0.974453,0.970803,...,0.974378,0.009477,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
3,0.191089,0.004487,0.006582,0.000797,90.0,50,"{'max_depth': 90, 'n_estimators': 50}",0.945255,0.974453,0.978102,...,0.972914,0.014686,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,0.515024,0.023774,0.016955,0.00141,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.959854,0.970803,0.970803,...,0.972914,0.008459,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0
5,1.047202,0.029132,0.031515,0.005698,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.959854,0.974453,0.970803,...,0.973646,0.010661,5,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,0.189694,0.007339,0.006582,0.000489,,50,"{'max_depth': None, 'n_estimators': 50}",0.959854,0.974453,0.970803,...,0.974378,0.009477,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
7,0.588228,0.042338,0.020944,0.003888,,150,"{'max_depth': None, 'n_estimators': 150}",0.963504,0.974453,0.970803,...,0.974378,0.008288,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,1.398865,0.197239,0.033936,0.005029,,300,"{'max_depth': None, 'n_estimators': 300}",0.963504,0.974453,0.967153,...,0.973646,0.008721,5,1.0,1.0,1.0,1.0,1.0,1.0,0.0


Using the count cectorizer:

In [8]:
count_vect = CountVectorizer(analyzer = split_by_comma)
X_count = count_vect.fit_transform(data["stage_3"])
X_count_feat= pd.DataFrame(X_count.toarray())
X_count_feat.columns = count_vect.get_feature_names()

Below is a preview of the count vectorizer features.

In [9]:
X_count_feat.head()

Unnamed: 0,$10,$15,$150,$20,$20pm,$30,$40,$50,$75,00am,...,work,workspace,world,would,writing,wu,www,year,york,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Below are the results of the grid search using the count vectorizer features.

In [10]:
gs_fit_count = gs.fit(X_count_feat, data["post_type"])
pd.DataFrame(gs_fit_count.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.220907,0.053369,0.011072,0.005518,60.0,50,"{'max_depth': 60, 'n_estimators': 50}",0.959854,0.956204,0.978102,...,0.969253,0.00936,9,1.0,1.0,1.0,1.0,1.0,1.0,0.0
1,0.785808,0.052026,0.024754,0.003167,60.0,150,"{'max_depth': 60, 'n_estimators': 150}",0.967153,0.970803,0.970803,...,0.973646,0.005294,5,1.0,1.0,1.0,1.0,1.0,1.0,0.0
2,1.431239,0.031272,0.051969,0.005537,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.967153,0.963504,0.970803,...,0.972914,0.007802,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0
3,0.254686,0.027499,0.010693,0.002721,90.0,50,"{'max_depth': 90, 'n_estimators': 50}",0.967153,0.963504,0.970803,...,0.973646,0.008389,5,1.0,1.0,1.0,0.999086,1.0,0.999817,0.000366
4,0.715884,0.025674,0.030915,0.00929,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.974453,0.963504,0.974453,...,0.976574,0.00847,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
5,1.630293,0.159672,0.058643,0.020277,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.970803,0.970803,0.974453,...,0.978038,0.007607,1,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,0.316553,0.035573,0.011171,0.003858,,50,"{'max_depth': None, 'n_estimators': 50}",0.967153,0.970803,0.970803,...,0.974378,0.006039,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
7,0.788559,0.043687,0.025175,0.003059,,150,"{'max_depth': None, 'n_estimators': 150}",0.963504,0.963504,0.974453,...,0.972914,0.008134,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,1.615574,0.182188,0.049573,0.017398,,300,"{'max_depth': None, 'n_estimators': 300}",0.970803,0.967153,0.978102,...,0.975842,0.006319,3,1.0,1.0,1.0,1.0,1.0,1.0,0.0


Using the 2-Gram vectorizer:

In [11]:
#putting tockenized sentence back into a sentence with spaces for ngrams
data_ngram = data["stage_3"].apply(lambda x: x.replace(","," "))

gram2_vect = CountVectorizer(ngram_range = (2,2))
X_gram2 = gram2_vect.fit_transform(data_ngram)
X_gram2_feat= pd.DataFrame(X_gram2.toarray())
X_gram2_feat.columns = gram2_vect.get_feature_names()

Below is a preview of the 2-gram vectorizer features.

In [12]:
X_gram2_feat.head()

Unnamed: 0,00 25,00 27,00 door,00 message,00 northyork,00 thursday,00 today,00am 20,00am 24,00am back,...,york scarborough,york station,york to,york today,york tomorrow,york tonight,york toronto,york waterloo,york york,young clinton
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Below are the results of the grid search using the 2-gram vectorizer features.

In [13]:
gs_fit_gram2 = gs.fit(X_gram2_feat, data["post_type"])
pd.DataFrame(gs_fit_gram2.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,0.958746,0.282798,0.021765,0.00412,60.0,50,"{'max_depth': 60, 'n_estimators': 50}",0.945255,0.952555,0.934307,...,0.953148,0.01283,4,0.997253,0.998168,0.999084,0.997258,0.997258,0.997804,0.000731
1,4.047605,0.280491,0.056336,0.003915,60.0,150,"{'max_depth': 60, 'n_estimators': 150}",0.934307,0.948905,0.930657,...,0.95022,0.016636,7,0.999084,0.999084,1.0,0.999086,0.998172,0.999085,0.000578
2,8.638615,0.155951,0.116456,0.008277,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.934307,0.952555,0.934307,...,0.952416,0.01622,6,0.999084,0.999084,1.0,0.999086,0.999086,0.999268,0.000366
3,1.564484,0.041508,0.028568,0.00362,90.0,50,"{'max_depth': 90, 'n_estimators': 50}",0.941606,0.952555,0.941606,...,0.956076,0.013939,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
4,4.665439,0.372282,0.073367,0.013434,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.937956,0.956204,0.937956,...,0.956076,0.016882,2,1.0,1.0,1.0,1.0,1.0,1.0,0.0
5,8.469072,0.498171,0.130874,0.016524,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.934307,0.952555,0.934307,...,0.953148,0.017083,4,1.0,1.0,1.0,1.0,1.0,1.0,0.0
6,1.531452,0.127608,0.026161,0.007925,,50,"{'max_depth': None, 'n_estimators': 50}",0.948905,0.963504,0.934307,...,0.95754,0.014077,1,0.999084,1.0,1.0,1.0,1.0,0.999817,0.000366
7,3.984835,0.051413,0.057766,0.005021,,150,"{'max_depth': None, 'n_estimators': 150}",0.934307,0.959854,0.930657,...,0.95022,0.014799,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0
8,6.865542,1.301235,0.087867,0.035483,,300,"{'max_depth': None, 'n_estimators': 300}",0.930657,0.952555,0.934307,...,0.95022,0.01616,7,1.0,1.0,1.0,1.0,1.0,1.0,0.0


Now since the plan is to classify posts on an hourly basis, the classification time is unimportant, so we ignore the score time. The mean test score score is essentially 97% for both the TFIDF and count vectorizers and essentially 95% for the 2-gram vectorizer. Since there is hardly any difference in the test score between different hyper parameters, we choose the simplest model which is the count vectorizer with 50 estimators and a max depth of 60. We could continue further with tuning the hyper parameters or try different models, however, the results are satisfactory so we stop here.

In [14]:
model = RandomForestClassifier(n_estimators = 60, max_depth = 10, n_jobs=-1).fit(X_count_feat,data["post_type"])

One of the nice things about a random forest model is that we can see the importance of every feature. The top 10 most important features of our chosen model (for the count vectorizer the features are individual words) are below.

In [15]:
[x[1] for x in sorted(zip(model.feature_importances_, count_vect.get_feature_names()), reverse = True)[0:10]]

['looking',
 'ride',
 'driving',
 '$15',
 'waterloo',
 'at',
 'from',
 'anytime',
 'anyone',
 'text']

The words "looking" and "driving" are the most important, which is no surprise. However, "$15" is also an important feature. This is probably because people offering rides are usually the ones who name prices.

## Route Detection:

The words "to" and "from" are words commonly used in carpooling posts, or so I suspect. Lets see the percentage of posts which contain the word "to" or the word "from" from our batch of posts.

In [16]:
sum([1 for x in derived_posts["stage_3"].tolist() if ("to" or "from") in split_by_comma(x)])/derived_posts.shape[0]

0.9692338694560042

More than 95% of the posts contain the words "to" and "from". And so we can use the relative location of cities in the post, in terms of the words "to" and "from", to determine the origins and destinations of the trips in the post. For posts with 1 trip, or posts with multiple trips in which the words “to” and/or “from” only appear once (for example “Driving from Waterloo to Mississauga then Toronto”) this method is almost flawless. However, for other posts, this method only successfully finds the first trip. Approximately 700 if the previously classified posts have been manually counted for their number of routes. So let see the percentage of the counted posts that have 1 trip are

In [17]:
sum([1 for x in data.dropna()["route_count"] if x == 1])/data.dropna().shape[0]

0.8349120433017592

So using our sample, we conclude that approximately 83% of posts have 1 trip. Due to the high percentage this method of route detection is viable.