# Analysis of Relationships and Topics among Tweets and Facebook Posts Associated with specific banks

## Questions and Deliverables

Q1. What financial topics* do consumers discuss on social media and what caused the consumers to post about this topic? 
   
>   Deliverable A - Describe your Approach and Methodology. Include a visual representation of your analytic process flow. 
   
>   Deliverable B - Discuss the data and its relationship to social conversation drivers. 

>   Deliverable C - Document your code and reference the analytic process flow-diagram from deliverable A. 


Q2. Are the topics and “substance” consistent across the industry or are they isolated to individual banks? 

>   Deliverable D - Create a list of topics and substance you found 

>   Deliverable E - Create a narrative of insights supported by the quantitative results (should include graphs or charts) 

#### Metadata:
Record Count: 220377

Media Type: Facebook & Twitter

Timeframe: Twitter data (8/2015) & Facebook data (8/2014 - 8/2015)

Scope: Social Media data with query of 4 banks

#### Scubbed Data:
4 Banks: BankA, BankB, BankC, BankD

ADDRESS: All scrubbed addresses are replaced by uppercase ADDRESS. Any occurrence of a lowercase "address" is part of the text and is not a scrubbed replacement.

Name: All names have been replaced with the lowercase word "Name"
Internet links

INTERNET: All scrubbed INTERNET references are replaced by uppercase INTERNET. Any occurrence of a lowercase "internet" is part of the text and is not a scrubbed replacement.

twit_hndl: All actual twitter handles "@" have been replaced with the lowercase abbreviation "twit_hndl". All Bank twitter handles have been replaced with the lowercase abbreviation followed by the respectively Bank "twit_hndl_BankA" , "twit_hndl_BankB"

PHONE: All scrubbed phone numbers are replaced by uppercase PHONE. Any occurrence of a lowercase "phone" is part of the text and is not a scrubbed replacement.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('2015+Wells+Fargo+Campus+Analytic+Challenge+Dataset.txt', sep='|')
df.head(5)

Unnamed: 0,AutoID,Date,Year,Month,MediaType,FullText
0,1,8/26/2015,2015,8,twitter,3 ways the internet of things will change Bank...
1,2,8/5/2015,2015,8,twitter,BankB BankB Name downgrades apple stock to neu...
2,3,8/12/2015,2015,8,twitter,BankB returns to profit on INTERNET/! board2? ...
3,4,8/5/2015,2015,8,twitter,BankB tells advisers to exit paulson hedge fun...
4,5,8/12/2015,2015,8,twitter,BankC may plead guilty over foreign exchange p...


In [3]:
import re

# preprocess the scrubbed strings
p0 = re.compile(r'ADDRESS|Name|INTERNET|twit_hndl_?|PHONE')
pretext = df['FullText'].map(lambda x: p0.sub('', x))

## Find the most popular tags

In [4]:
text0 = pretext.map(lambda t: t.lower())
rawtags1 = text0.map(lambda t: re.findall('\#\s\w+', t))
rawtags1 = rawtags1.map(lambda t: [w for w in t if w is not None])
rawtags2 = []
for t in rawtags1:
    if len(t) > 0:
        for _ in t:
            rawtags2.append(_)
rawtags3 = list(set(rawtags2))
rawtags4 = dict((a, rawtags2.count(a)) for a in rawtags3)
hashtags = sorted(rawtags4.iteritems(), key=lambda (k, v): -v)

In [5]:
hashtags[:30]

[('# bankc', 7003),
 ('# contest', 4685),
 ('# getcollegeready', 4626),
 ('# bankb', 3782),
 ('# banka', 3721),
 ('# bankd', 3314),
 ('# finance', 2680),
 ('# money', 2082),
 ('# goldmansachs', 1990),
 ('# wallstreet', 1987),
 ('# banksters', 1948),
 ('# economics', 1924),
 ('# hsbc', 1922),
 ('# usbank', 1922),
 ('# morganstanley', 1920),
 ('# federalreserve', 1915),
 ('# classwarfare', 1912),
 ('# financialterrorists', 1910),
 ('# stocks', 513),
 ('# business', 488),
 ('# news', 469),
 ('# c', 390),
 ('# realestate', 367),
 ('# stock', 359),
 ('# investment', 341),
 ('# banking', 339),
 ('# forex', 334),
 ('# share', 332),
 ('# acn', 329),
 ('# smallbiz', 301)]

Going back to the tweets, the hastags "contest" and "getcollegeready" always come together, and this campaign certainly draw much attention as the bank would expect. So this campaign is a highlight among the others. Other than that, my first impression is that when people talk about banks, they care about keeping their money safe and their investments increasingly growing up (by talking about ecnomy, stocks, federa reserve, forex, small biz), and there seems be an emotion about the banks as the opposite side of normal ones. I would have expected many hashtags on the services or other campaigns, but we need further tagging process to figure out how these information is related.

In [6]:
# do some viz about dist of hastags among different banks

## Prepare the Data Grouped By Banks

In [7]:
text = pretext.map(lambda x: x.lower())
text = text.map(lambda x: re.split('\W+|_+', x, flags=re.IGNORECASE))
text.head(5)

0    [3, ways, the, internet, of, things, will, cha...
1    [bankb, bankb, downgrades, apple, stock, to, n...
2    [bankb, returns, to, profit, on, board2, t, 95...
3    [bankb, tells, advisers, to, exit, paulson, he...
4    [bankc, may, plead, guilty, over, foreign, exc...
Name: FullText, dtype: object

In [8]:
indA = ['banka' in t for t in text]
indB = ['bankb' in t for t in text]
indC = ['bankc' in t for t in text]
indD = ['bankd' in t for t in text]

In [9]:
# pre-complie patterns to speed up
p1 = re.compile(r"(RT|via)((?:\b\W*@\w+)+)")  # remove retweet or via mark
p2 = re.compile(r"@\w+")  # remove at mark
p3 = re.compile(r"^[0-9]+$")  # remove pure numbers
p4 = re.compile(r"http\w+")  # remove http address
p5 = re.compile(r"^\s+|\s+$")  # remove space
p6 = re.compile(r"^\w*(\w)(\1){2,}\w*&")  # remove repetitive letters
p7 = re.compile(r"^\w{2}$")  # remove words with only two letters
text = text.map(lambda x: [p1.sub("", t) for t in x])
text = text.map(lambda x: [p2.sub("", t) for t in x])
text = text.map(lambda x: [p3.sub("", t) for t in x])
text = text.map(lambda x: [p4.sub("", t) for t in x])
text = text.map(lambda x: [p5.sub("", t) for t in x])
text = text.map(lambda x: [p6.sub("", t) for t in x])
text = text.map(lambda x: [p7.sub("", t) for t in x])
text.head(2)

0    [, ways, the, internet, , things, will, change...
1    [bankb, bankb, downgrades, apple, stock, , neu...
Name: FullText, dtype: object

In [10]:
stopwords = pd.read_csv('stopwords.txt')
outwords = ['bank','twit','hndl','lol','hey','make','made','name','don','bit','uhijre','ret','bankac','resp','ers','today','ift','dlvr','plc','goo','man','banke','bankds']

def rmStopWord(wlist):
    return [w for w in wlist if not (w == '' or w in stopwords.values or w in outwords)]     

posttext = text.map(lambda x: rmStopWord(x))
posttext.head(2)

0    [ways, internet, things, change, bankb, bankc,...
1    [bankb, bankb, downgrades, apple, stock, neutr...
Name: FullText, dtype: object

## Learn From TF-IDF Matrix

In [29]:
def prepareLines(wlist):
    wlist = [w for w in wlist if not w in ['banka', 'bankb', 'bankc', 'bankd']]
    return ' '.join(wlist)

lines = posttext.map(lambda x: prepareLines(x))

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer(min_df=1)
Xtfidf = tfidfvec.fit_transform(lines)

In [31]:
idf = dict(zip(tfidfvec.get_feature_names(), tfidfvec.idf_))
top10 = sorted(idf.iteritems(), key=lambda x: -x[1])[:10]
top10                                                     

[(u'fawk', 12.609952352206955),
 (u'zbnlwm1sov5vz', 12.609952352206955),
 (u'bestcustomerservice', 12.609952352206955),
 (u'wednesd', 12.609952352206955),
 (u'33gdrk', 12.609952352206955),
 (u'kurringaibankd2015', 12.609952352206955),
 (u'wannaaaa', 12.609952352206955),
 (u'1j9qzei', 12.609952352206955),
 (u'aresearch', 12.609952352206955),
 (u'mumfordandsons', 12.609952352206955)]

Still lots of noises there since TFIDF tend to over-estimate the importance of very rare words. We would just keep it and move on to the next method.

## K-Means Clustering

As k-means is optimizing a non-convex objective function, it will likely end up in a local optimum. Trying with different parameters is neccessary.

In [34]:
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

true_k = 6
km_model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
km_model.fit(Xtfidf)

KMeans(copy_x=True, init='k-means++', max_iter=100, n_clusters=6, n_init=1,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

In [35]:
print("Top terms per cluster:")
order_centroids = km_model.cluster_centers_.argsort()[:, ::-1]
terms = tfidfvec.get_feature_names()
for i in range(true_k):
    print "Cluster {}:".format(i)
    kws = []
    for ind in order_centroids[i, :10]:
        kws.append(terms[ind])
    print "{}".format(', '.join(kws)+'\n')

Top terms per cluster:
Cluster 0:
contest, getcollegeready, vote, street, money, main, business, mission, card, photo

Cluster 1:
pay, mortgage, home, loan, million, settlement, loans, billion, student, settle

Cluster 2:
assistance, called, additional, trouble, provided, happy, future, call, needed, information

Cluster 3:
account, assist, happened, numbers, money, open, checking, dont, tweet, card

Cluster 4:
management, asset, stockport, wealth, advisers, managers, financial, cheshire, ftse, independent

Cluster 5:
msg, dir, zip, phone, follow, call, account, number, requested, concerns



### Take a look at the clusters

As a quick shot, K-Means works better than I have expected. The clusters are clearly divided and the term with highest frequency is exactly the summary of the cluster. As indicated ahead, the clusters are:

1. contest - tweets about the 'get college ready' contest.
2. pay - tweets about paying mortgages, loans including student loans.
3. assistance - tweets about getting assitance from the bank by calls, getting information and get all of trouble.
4. management - tweets about asset management with financial advisors 
5. msg - tweets about providing informations to the others

Amazing right?!

To determine wether there are differences of topics among different banks, we could divide the dataset and apply TFIDF and K-Means Method respectively.

## Generate LDA Topic Models

In [16]:
from sklearn.feature_extraction.text import CountVectorizer 
import gensim

def generate_ldamodel(indArray ,min_df=1, num_topics=10):
    countvec = CountVectorizer(min_df=min_df)
    X = countvec.fit_transform(indArray)
    corpus = countvec.get_feature_names()
    id2words = dict((v, k) for k, v in countvec.vocabulary_.iteritems())
    corpus_gensim = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    ldamodel = gensim.models.ldamodel.LdaModel(corpus_gensim, id2word=id2words, num_topics=10, update_every=1, chunksize=1000, passes=1)
    return ldamodel

modelA = generate_ldamodel(lines[indA])
modelB = generate_ldamodel(lines[indB])
modelC = generate_ldamodel(lines[indC])
modelD = generate_ldamodel(lines[indD])


In [17]:
modelA.print_topics()

[u'0.061*money + 0.059*bankb + 0.047*banka + 0.039*photo + 0.031*shared + 0.031*bankc + 0.021*goldmansachs + 0.021*morganstanley + 0.021*wallstreet + 0.021*financialterrorists',
 u'0.106*banka + 0.053*account + 0.037*card + 0.025*money + 0.022*credit + 0.022*dont + 0.015*usbank + 0.015*pay + 0.011*business + 0.011*loan',
 u'0.115*banka + 0.025*home + 0.022*business + 0.017*mortgage + 0.016*working + 0.014*wow + 0.009*building + 0.009*market + 0.009*loans + 0.009*stock',
 u'0.127*banka + 0.026*customer + 0.024*service + 0.013*phone + 0.013*work + 0.013*time + 0.013*rebanke + 0.012*good + 0.010*company + 0.008*put',
 u'0.111*banka + 0.024*call + 0.015*job + 0.014*night + 0.014*give + 0.014*great + 0.012*fund + 0.012*whats + 0.011*team + 0.011*share',
 u'0.099*banka + 0.027*bankd + 0.026*finance + 0.024*account + 0.019*check + 0.019*federalreserve + 0.017*cash + 0.014*guys + 0.012*fees + 0.011*atm',
 u'0.108*banka + 0.067*center + 0.015*buy + 0.014*tickets + 0.008*philadelphia + 0.007*fri

In [18]:
modelB.print_topics()

[u'0.113*bankb + 0.022*customer + 0.021*service + 0.020*chicago + 0.016*marathon + 0.013*put + 0.010*sachs + 0.009*hate + 0.008*worst + 0.008*business',
 u'0.160*bankb + 0.041*stadium + 0.037*economics + 0.017*buy + 0.015*charlotte + 0.013*rating + 0.011*stock + 0.010*dollars + 0.009*apple + 0.009*things',
 u'0.068*bankb + 0.036*wallstreet + 0.018*check + 0.017*great + 0.015*job + 0.014*days + 0.013*banks + 0.012*week + 0.012*game + 0.012*credit',
 u'0.078*bankc + 0.053*banka + 0.045*bankb + 0.044*bankd + 0.041*money + 0.035*finance + 0.032*shared + 0.030*hsbc + 0.030*goldmansachs + 0.030*usbank',
 u'0.117*bankb + 0.070*money + 0.038*account + 0.031*bankd + 0.028*banksters + 0.025*banka + 0.014*dont + 0.012*shit + 0.011*fuck + 0.011*ass',
 u'0.079*bankb + 0.061*card + 0.031*account + 0.025*call + 0.017*debit + 0.015*number + 0.014*called + 0.012*pay + 0.012*check + 0.012*free',
 u'0.109*bankb + 0.023*banka + 0.014*atm + 0.012*account + 0.011*time + 0.011*people + 0.009*love + 0.009*guy

In [19]:
modelC.print_topics()

[u'0.148*shared + 0.101*bankc + 0.012*fund + 0.010*real + 0.009*settlement + 0.008*analyst + 0.007*government + 0.007*million + 0.007*mortgage + 0.007*court',
 u'0.090*bankc + 0.066*money + 0.063*hsbc + 0.060*economics + 0.012*street + 0.011*banking + 0.009*cut + 0.008*hold + 0.007*business + 0.007*banks',
 u'0.091*bankc + 0.022*service + 0.021*customer + 0.012*year + 0.012*team + 0.010*tower + 0.010*plaza + 0.008*trading + 0.007*data + 0.007*added',
 u'0.096*bankc + 0.025*financial + 0.015*wow + 0.014*good + 0.013*job + 0.011*loan + 0.009*student + 0.008*services + 0.008*home + 0.008*business',
 u'0.110*bankc + 0.040*card + 0.039*credit + 0.013*big + 0.011*account + 0.010*cards + 0.009*rebanke + 0.008*open + 0.007*dont + 0.007*worst',
 u'0.145*bankc + 0.125*photo + 0.030*theodwridis + 0.030*giannis + 0.025*money + 0.022*world + 0.019*oil + 0.018*price + 0.017*hall + 0.015*group',
 u'0.131*bankc + 0.062*banka + 0.059*banksters + 0.059*rating + 0.056*finance + 0.035*buy + 0.023*neutral 

In [20]:
modelD.print_topics()

[u'0.094*bankd + 0.016*price + 0.016*target + 0.008*real + 0.008*state + 0.007*list + 0.006*earns + 0.006*bad + 0.006*year + 0.005*profit',
 u'0.104*bankd + 0.029*card + 0.017*credit + 0.013*time + 0.009*fraud + 0.008*world + 0.008*debit + 0.008*center + 0.008*cards + 0.007*bankb',
 u'0.099*bankd + 0.029*hsbc + 0.029*goldmansachs + 0.017*money + 0.013*account + 0.012*manager + 0.012*people + 0.011*check + 0.011*community + 0.010*trade',
 u'0.105*street + 0.102*vote + 0.102*main + 0.101*mission + 0.060*business + 0.054*small + 0.051*program + 0.051*apply + 0.051*full + 0.051*learn',
 u'0.147*bankd + 0.053*rating + 0.037*photo + 0.030*overweight + 0.020*neutral + 0.018*group + 0.015*reiterated + 0.013*stock + 0.011*yum + 0.009*energy',
 u'0.105*bankd + 0.046*asset + 0.042*management + 0.041*stockport + 0.015*transfer + 0.008*support + 0.008*news + 0.007*investment + 0.006*talk + 0.006*find',
 u'0.074*bankd + 0.047*bankb + 0.041*money + 0.036*banka + 0.034*finance + 0.028*banksters + 0.02

## Generate Similars words from Word2Vec 

In [21]:
model2A = gensim.models.Word2Vec(posttext[indA], min_count=5, size=200, window=10, sample=1e-3)
model2B = gensim.models.Word2Vec(posttext[indB], min_count=5, size=200, window=10, sample=1e-3)
model2C = gensim.models.Word2Vec(posttext[indC], min_count=5, size=200, window=10, sample=1e-3)
model2D = gensim.models.Word2Vec(posttext[indD], min_count=5, size=200, window=10, sample=1e-3)

In [22]:
pos_words = ['getcollegeready', 'finance', 'wallstreet', 'economics', 'federalreserve']
neg_words = ['classwarfare', 'financialterrorists', 'finance', 'wallstreet', 'economics', 'federalreserve', 'banksters']

synonymA = model2A.most_similar(positive=['banka'])
synonymB = model2B.most_similar(positive=['bankb'])
synonymC = model2C.most_similar(positive=['bankc'])
synonymD = model2D.most_similar(positive=['bankd'])

In [23]:
synonymA

[('rips', 0.41391533613204956),
 ('legbanke', 0.4094012677669525),
 ('rejects', 0.4015510082244873),
 ('voltas', 0.3973686993122101),
 ('controls', 0.3954523801803589),
 ('diverse', 0.39538949728012085),
 ('laundered', 0.39101722836494446),
 ('racked', 0.3908594846725464),
 ('priority', 0.3881450593471527),
 ('bankb', 0.386219322681427)]

In [24]:
synonymB

[('banka', 0.40592920780181885),
 ('flirting', 0.40380358695983887),
 ('time', 0.40342628955841064),
 ('deactivating', 0.3880406320095062),
 ('onna', 0.38779035210609436),
 ('money', 0.3877057433128357),
 ('momma', 0.38676372170448303),
 ('finest', 0.38422441482543945),
 ('bees', 0.3828417658805847),
 ('bday', 0.3821169137954712)]

In [25]:
synonymC

[('ally', 0.4044133424758911),
 ('joy', 0.4033159017562866),
 ('mobi', 0.3953450620174408),
 ('brad', 0.3903713822364807),
 ('tradehero', 0.3739902675151825),
 ('americans', 0.36382365226745605),
 ('democrat', 0.35904332995414734),
 ('mac', 0.35303962230682373),
 ('bust', 0.35217738151550293),
 ('plutonomy', 0.3496810793876648)]

In [26]:
synonymD

[('starwood', 0.3895378112792969),
 ('format', 0.3509492874145508),
 ('a', 0.3482472598552704),
 ('slices', 0.34181737899780273),
 ('crush', 0.33760976791381836),
 ('viking', 0.3316380977630615),
 ('lifes', 0.33112823963165283),
 ('vietnamese', 0.3307943344116211),
 ('refunding', 0.32516494393348694),
 ('somethin', 0.32326266169548035)]

For the record, the Word2Vec is not giving out explainable information on the topic, and it is very hard to get the best result since the output is very sensitive to the parameters. But it could help as a good supplement. Here are several interesting findings:

1. banka and bankb are more related than other two banks since in banka's synonyms appear bankb, and vice versa.
2. In banka's synonyms, legbanke aprears approximately 80 times. Let's go back to the tweets and find the original sentence: "they keep updating me on what i no longer legbanke owe them. Name. we see u BankA thugs!" Seems that the word should be legally, and the engineers subsititute the 'ally' with 'banke'. Also, lots of negative words for banka.
3. In bankc's synonyms, two words catch my eyes: democrat and tradehero. It seems that many people use tradehero to buy shares of bankc and bankd.
4. In bankd's synonyms, 'viking' and 'vikings stadium' is mentioned a lot, gotcha U.S.Bank.

