# Problem #1: Machine Learning and Text Analysis (suggested time: 60-90m)

## Table of Contents
1. Explore Data
2. Build Classifier (initial model)
3. Measure Results (initial model)
4. Predict group membership for unlabeled tweets **[Task A]**
5. Understand content of posts between Group 1 and Group 2 **[Task B]**
6. Summary of Project + Findings

## 1. Read in Data

In [3]:
# Import libraries
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv("data.csv", sep = ',')

In [4]:
df.head(10)

Unnamed: 0,text,label
0,#Job #Boston Site Supervisor / Lead Carpenter:...,
1,RT @wilw NBC reporting suspect alive and in cu...,1.0
2,#BostonMararthon suspect Dzhokhar Tsarnaev is ...,1.0
3,THANK YOU ?MT @Boston_Police: CAPTURED! The hu...,
4,RT @CNBCClosingBell The Boston College Center ...,
5,Diversity: BU President To Testify Before The ...,
6,RT @BreitbartNews Margaret Thatcher Remembered...,1.0
7,@Kid_Ink \nNEW MUSIC @DubbZeroBFMI ft @TroyAve...,
8,@barstoolsports ?@barstoolsports: Some tremend...,
9,@itsYONAS thanks for the music man had this sh...,


I notice some NaN values, let's see how many rows are missing from entire dataset.

### Explore Data, Clean Data

In [5]:
df.isna().sum()

text         0
label    12218
dtype: int64

In [6]:
# It seems we have missing LABEL data. What is percentage of missing data?
df['label'].count() / len(df['label'])

0.24663953631767171

In [7]:
# Let's ensure we keep a dataframe of those with no labels, we will need this for later (Prediction Task)
missing_vals_df = df[df.isnull().any(1)]

In [8]:
# Now that we've stored the missing rows into a separate dataframe, 
# we can drop them from our initial model
df = df.dropna()

In [9]:
# Check to see if classes are balanced. Unbalanced classes will require some additional work.
df['label'].value_counts()

0.0    2000
1.0    2000
Name: label, dtype: int64

Cool, balanced!

### Prepare Data for Modeling

In [12]:
X = df['text']
y = df['label']

In [13]:
from sklearn.model_selection import train_test_split

# ensure we have hold-out set to predict on! or we may overfit.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 7)

## 2. Build Classifier (initial model)

In [15]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [16]:
twitter_clf = Pipeline([('tfidf', TfidfVectorizer()), 
                        ('clf', LinearSVC())])

In [17]:
twitter_clf.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

In [18]:
predictions = twitter_clf.predict(X_test)

## 3. Measure Results (initial model)

In [19]:
from sklearn.metrics import confusion_matrix, accuracy_score

In [20]:
print(confusion_matrix(y_test, predictions))

[[668   3]
 [  2 647]]


In [21]:
print(accuracy_score(y_test, predictions))

0.9962121212121212


99.6% accuracy score is great! Given our classes are balanced (50-50), "accuracy" score is a solid evaluation criteria. 

If our classes were imbalanced, we could have utilized precision-recall as our evaluation critiera.

## 4. Predict group membership for unlabeled tweets (Task A)

Utilize existing model (of 99.6% accuracy score) to predict on the rest of dataset with missing labels

In [22]:
# bring back the set-aside df from earlier
missing_vals_df

Unnamed: 0,text,label
0,#Job #Boston Site Supervisor / Lead Carpenter:...,
3,THANK YOU ?MT @Boston_Police: CAPTURED! The hu...,
4,RT @CNBCClosingBell The Boston College Center ...,
5,Diversity: BU President To Testify Before The ...,
7,@Kid_Ink \nNEW MUSIC @DubbZeroBFMI ft @TroyAve...,
...,...,...
16211,@MzSexxyJas We could be looking at 3 feet of s...,
16213,"""@MsStacyThatsMe: @4evergraceJONES Lol We're C...",
16214,RT @BostonDotCom This is according to @AP : Su...,
16216,Thanks RT @laVisualiza: Boston is truly a beau...,


In [23]:
# prepare for modeling
missing_vals_X = missing_vals_df['text'] 
missing_vals_y = missing_vals_df['label']

In [24]:
missing_vals_predictions = twitter_clf.predict(missing_vals_X)

In [25]:
missing_vals_predictions

array([0., 1., 0., ..., 1., 0., 1.])

### Add missing value predictions

In [27]:
missing_vals_df['label'] = list(pd.Series(missing_vals_predictions)) 

In [28]:
missing_vals_df

Unnamed: 0,text,label
0,#Job #Boston Site Supervisor / Lead Carpenter:...,0.0
3,THANK YOU ?MT @Boston_Police: CAPTURED! The hu...,1.0
4,RT @CNBCClosingBell The Boston College Center ...,0.0
5,Diversity: BU President To Testify Before The ...,0.0
7,@Kid_Ink \nNEW MUSIC @DubbZeroBFMI ft @TroyAve...,0.0
...,...,...
16211,@MzSexxyJas We could be looking at 3 feet of s...,0.0
16213,"""@MsStacyThatsMe: @4evergraceJONES Lol We're C...",1.0
16214,RT @BostonDotCom This is according to @AP : Su...,1.0
16216,Thanks RT @laVisualiza: Boston is truly a beau...,0.0


### Ensure new labeled data is consistent with original labeled data


Task Two is asking us to look at the content of the posts between Group 1 and 2, but it would be good to check to see if the prediction labels we added are consistent across the original labeled data and the now, newly labeled data.

In [29]:
from sklearn.feature_extraction.text import CountVectorizer

# Create a function that will require top words and their frequency count

def get_top_words(corpus, n=None):
    
    vec = CountVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    words_sum = bag_of_words.sum(axis=0) 
    
    words_freq = [(w, words_sum[0, idx]) for w, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
    
    return words_freq[:n]

In [30]:
# Original and Missing Value (Group 1 and Group 2) DFs
og_group1 = df[df['label'] == 1]
og_group2 = df[df['label'] == 0]
mv_group1 = missing_vals_df[missing_vals_df['label'] == 1]
mv_group2 = missing_vals_df[missing_vals_df['label'] == 0]

In [31]:
# get top 10 words for each respective group
og_group1_tw = get_top_words(og_group1['text'], 10)
mv_group1_tw = get_top_words(mv_group1['text'], 10)
og_group2_tw = get_top_words(og_group2['text'], 10)
mv_group2_tw = get_top_words(mv_group2['text'], 10)

#### Originally Labeled, Group 1

In [32]:
og_group1_tw

[('rt', 1090),
 ('http', 660),
 ('prayforboston', 615),
 ('boston', 568),
 ('watertown', 541),
 ('tsarnaev', 311),
 ('attack', 259),
 ('dzhokhar', 231),
 ('suspect', 207),
 ('terrorist', 173)]

#### Newly Labeled, Group 1 

In [33]:
mv_group1_tw

[('rt', 2685),
 ('http', 1734),
 ('prayforboston', 1545),
 ('watertown', 1439),
 ('boston', 1431),
 ('tsarnaev', 684),
 ('attack', 633),
 ('dzhokhar', 500),
 ('suspect', 477),
 ('terror', 465)]

#### Originally Labeled, Group 2

In [34]:
og_group2_tw

[('boston', 2299),
 ('http', 1531),
 ('rt', 453),
 ('job', 410),
 ('ma', 319),
 ('jobs', 113),
 ('celtics', 100),
 ('time', 82),
 ('cambridge', 80),
 ('new', 79)]

#### Newly Labeled, Group 2

In [35]:
mv_group2_tw

[('boston', 8174),
 ('http', 5623),
 ('rt', 1471),
 ('job', 1467),
 ('ma', 1091),
 ('jobs', 376),
 ('new', 311),
 ('celtics', 294),
 ('news', 290),
 ('time', 248)]

Content across original and newly labeled Tweets appear consistent for Groups 1 and 2!

## 5. Understand content of posts between Group 1 and Group 2 (Task B)

Our client cares about the differences between Group 1 and Group 2. I think an easy way to paint the picture is by looking at top words used in Tweets.

In [36]:
# Combine original df (with labels) with newly-classified df
group1 = pd.concat([mv_group1, og_group1])
group2 = pd.concat([mv_group2, og_group2])

In [37]:
group1

Unnamed: 0,text,label
3,THANK YOU ?MT @Boston_Police: CAPTURED! The hu...,1.0
13,RT @RevEverett #interfaith love song: Trinity ...,1.0
16,Researchers urge brain autopsy of bombing susp...,1.0
18,RT @causeOlasaidso no matter how much goes dow...,1.0
19,Like this self-righteous piece of shit didn't ...,1.0
...,...,...
16190,RT @bommadog That awkward moment when Twitter ...,1.0
16192,RT @paulaebbenwbz: Source: talk of bringing in...,1.0
16199,Crowds cheering tactical teams coming out of F...,1.0
16212,RT @Karmaloopboston Updated photo of 19 year-o...,1.0


In [38]:
# Get top 20 words for both groups
group1_tw = get_top_words(group1['text'], 20)
group2_tw = get_top_words(group2['text'], 20)

### Group 1

In [40]:
for word, freq in group1_tw:
    print(word, freq)

rt 3775
http 2394
prayforboston 2160
boston 1999
watertown 1980
tsarnaev 995
attack 892
dzhokhar 731
suspect 684
terror 635
marathon 595
terrorist 581
police 491
bostonmarathon 378
people 364
bostonstrong 361
just 318
bombing 301
manhunt 291
news 268


### Group 2

In [41]:
for word, freq in group2_tw:
    print(word, freq)

boston 10473
http 7154
rt 1924
job 1877
ma 1410
jobs 489
celtics 394
new 390
news 361
time 330
cambridge 327
manager 325
https 277
bos 271
today 254
engineer 246
day 240
nba 239
city 236
looking 209


### FINDINGS
- **GROUP 1 (group of interest)** Tweets were related to the Boston Marathon bombings.
- **GROUP 2** Tweets were on subjects such as sports (celtics, nba) and job-hunt related topics (jobs, manager, engineer, looking).

## Summary of Project + Findings

- I initially discover that 75% of the data is missing. Fortunately, 25% of the data (4000 observations) was labeled with **balanced** classes.
- I utilized my 99.6% accuracy model to label the rest of the missing label data.
- We find that GROUP 1 Tweets were related to the Boston Marathon bombings, while GROUP 2 Tweets were related to Sports and Job-Hunting.

### If I had more time ..

- It would have been fun to beautify/visualize (WordClouds) the word frequencies!
- Would have been cool to do Sentiment Analysis and perhaps expand timeframe to deeply understand the Groups.
    - Group 1 seems to be folks that frequently follow the news. I can probably confirm that if we got a larger scope of Tweets (across a larger timeframe). Perhaps we can advertise News Apps or offer discounted newspaper subscriptions to this audience.
    - Group 2 can be seen as sports junkies, but we probably want a larger timeframe just to confirm if these are NBA-only fans or if they bleed Boston sports and Tweet about everything sports (MLB, NFL, NHL). If so, sports cable packages can be a targeted ad. 
    - Group 2 also includes job-hunters. But it would be interesting to see if this can be isolated to a smaller population (I see "Cambridge" as a keyword), so these Twitter users can possibly be students. Plenty of opportunity to share this data with headhunters or student aid/resume writer professionals.