# Analysis of Relationships and Topics among Tweets and Facebook Posts Associated with specific banks

## Questions and Deliverables

Q1. What financial topics* do consumers discuss on social media and what caused the consumers to post about this topic? 
   
>   Deliverable A - Describe your Approach and Methodology. Include a visual representation of your analytic process flow. 
   
>   Deliverable B - Discuss the data and its relationship to social conversation drivers. 

>   Deliverable C - Document your code and reference the analytic process flow-diagram from deliverable A. 


Q2. Are the topics and “substance” consistent across the industry or are they isolated to individual banks? 

>   Deliverable D - Create a list of topics and substance you found 

>   Deliverable E - Create a narrative of insights supported by the quantitative results (should include graphs or charts) 

#### Metadata:
Record Count: 220377

Media Type: Facebook & Twitter

Timeframe: Twitter data (8/2015) & Facebook data (8/2014 - 8/2015)

Scope: Social Media data with query of 4 banks

#### Scubbed Data:
4 Banks: BankA, BankB, BankC, BankD

ADDRESS: All scrubbed addresses are replaced by uppercase ADDRESS. Any occurrence of a lowercase "address" is part of the text and is not a scrubbed replacement.

Name: All names have been replaced with the lowercase word "Name"
Internet links

INTERNET: All scrubbed INTERNET references are replaced by uppercase INTERNET. Any occurrence of a lowercase "internet" is part of the text and is not a scrubbed replacement.

twit_hndl: All actual twitter handles "@" have been replaced with the lowercase abbreviation "twit_hndl". All Bank twitter handles have been replaced with the lowercase abbreviation followed by the respectively Bank "twit_hndl_BankA" , "twit_hndl_BankB"

PHONE: All scrubbed phone numbers are replaced by uppercase PHONE. Any occurrence of a lowercase "phone" is part of the text and is not a scrubbed replacement.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv('2015+Wells+Fargo+Campus+Analytic+Challenge+Dataset.txt', sep='|')
df.head(5)

Unnamed: 0,AutoID,Date,Year,Month,MediaType,FullText
0,1,8/26/2015,2015,8,twitter,3 ways the internet of things will change Bank...
1,2,8/5/2015,2015,8,twitter,BankB BankB Name downgrades apple stock to neu...
2,3,8/12/2015,2015,8,twitter,BankB returns to profit on INTERNET/! board2? ...
3,4,8/5/2015,2015,8,twitter,BankB tells advisers to exit paulson hedge fun...
4,5,8/12/2015,2015,8,twitter,BankC may plead guilty over foreign exchange p...


In [3]:
import re

# preprocess the scrubbed strings
p0 = re.compile(r'ADDRESS|Name|INTERNET|twit_hndl_?|PHONE')
pretext = df['FullText'].map(lambda x: p0.sub('', x))

## Find the most popular tags

In [4]:
text0 = pretext.map(lambda t: t.lower())
rawtags1 = text0.map(lambda t: re.findall('\#\s\w+', t))
rawtags1 = rawtags1.map(lambda t: [w for w in t if w is not None])
rawtags2 = []
for t in rawtags1:
    if len(t) > 0:
        for _ in t:
            rawtags2.append(_)
rawtags3 = list(set(rawtags2))
rawtags4 = dict((a, rawtags2.count(a)) for a in rawtags3)
hashtags = sorted(rawtags4.iteritems(), key=lambda (k, v): -v)

In [5]:
hashtags[:30]

[('# bankc', 7003),
 ('# contest', 4685),
 ('# getcollegeready', 4626),
 ('# bankb', 3782),
 ('# banka', 3721),
 ('# bankd', 3314),
 ('# finance', 2680),
 ('# money', 2082),
 ('# goldmansachs', 1990),
 ('# wallstreet', 1987),
 ('# banksters', 1948),
 ('# economics', 1924),
 ('# hsbc', 1922),
 ('# usbank', 1922),
 ('# morganstanley', 1920),
 ('# federalreserve', 1915),
 ('# classwarfare', 1912),
 ('# financialterrorists', 1910),
 ('# stocks', 513),
 ('# business', 488),
 ('# news', 469),
 ('# c', 390),
 ('# realestate', 367),
 ('# stock', 359),
 ('# investment', 341),
 ('# banking', 339),
 ('# forex', 334),
 ('# share', 332),
 ('# acn', 329),
 ('# smallbiz', 301)]

Going back to the tweets, the hastags "contest" and "getcollegeready" always come together, and this campaign certainly draw much attention as the bank would expect. So this campaign is a highlight among the others. Other than that, my first impression is that when people talk about banks, they care about keeping their money safe and their investments increasingly growing up (by talking about ecnomy, stocks, federa reserve, forex, small biz), and there seems be an emotion about the banks as the opposite side of normal ones. I would have expected many hashtags on the services or other campaigns, but we need further tagging process to figure out how these information is related.

In [6]:
# do some viz about dist of hastags among different banks

## Prepare the Data Grouped By Banks

In [7]:
text = pretext.map(lambda x: x.lower())
text = text.map(lambda x: re.split('\W+|_+', x, flags=re.IGNORECASE))
text.head(5)

0    [3, ways, the, internet, of, things, will, cha...
1    [bankb, bankb, downgrades, apple, stock, to, n...
2    [bankb, returns, to, profit, on, board2, t, 95...
3    [bankb, tells, advisers, to, exit, paulson, he...
4    [bankc, may, plead, guilty, over, foreign, exc...
Name: FullText, dtype: object

In [8]:
indA = ['banka' in t for t in text]
indB = ['bankb' in t for t in text]
indC = ['bankc' in t for t in text]
indD = ['bankd' in t for t in text]

In [9]:
# pre-complie patterns to speed up
p1 = re.compile(r"(RT|via)((?:\b\W*@\w+)+)")  # remove retweet or via mark
p2 = re.compile(r"@\w+")  # remove at mark
p3 = re.compile(r"^[0-9]+$")  # remove pure numbers
p4 = re.compile(r"http\w+")  # remove http address
p5 = re.compile(r"^\s+|\s+$")  # remove space
p6 = re.compile(r"^\w*(\w)(\1){2,}\w*&")  # remove repetitive letters
p7 = re.compile(r"^\w{2}$")  # remove words with only two letters
text = text.map(lambda x: [p1.sub("", t) for t in x])
text = text.map(lambda x: [p2.sub("", t) for t in x])
text = text.map(lambda x: [p3.sub("", t) for t in x])
text = text.map(lambda x: [p4.sub("", t) for t in x])
text = text.map(lambda x: [p5.sub("", t) for t in x])
text = text.map(lambda x: [p6.sub("", t) for t in x])
text = text.map(lambda x: [p7.sub("", t) for t in x])
text.head(2)

0    [, ways, the, internet, , things, will, change...
1    [bankb, bankb, downgrades, apple, stock, , neu...
Name: FullText, dtype: object

In [10]:
stopwords = pd.read_csv('stopwords.txt')
outwords = ['banka','bankb','bankc','bankd','bank','twit','hndl','lol','hey','make','made','name','don','bit','uhijre','ret','bankac','resp','ers','today','ift','dlvr','plc','goo','man','banke','bankds']

def rmStopWord(wlist):
    return [w for w in wlist if not (w == '' or w in stopwords.values or w in outwords)]     

posttext = text.map(lambda x: rmStopWord(x))
posttext.head(2)

0    [ways, internet, things, change, 1u5ad88, inte...
1    [downgrades, apple, stock, neutral, anticipate...
Name: FullText, dtype: object

## K-Mean Clustering

## Learn From TF-IDF Matrix

In [12]:
lines = posttext.map(lambda x: ' '.join(x))

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer(min_df=1)
Xtfidf = tfidfvec.fit_transform(lines)

In [14]:
idf = dict(zip(tfidfvec.get_feature_names(), tfidfvec.idf_))
top10 = sorted(idf.iteritems(), key=lambda x: -x[1])[:10]
top10                                                     

[(u'fawk', 12.609952352206955),
 (u'zbnlwm1sov5vz', 12.609952352206955),
 (u'bestcustomerservice', 12.609952352206955),
 (u'wednesd', 12.609952352206955),
 (u'33gdrk', 12.609952352206955),
 (u'kurringaibankd2015', 12.609952352206955),
 (u'wannaaaa', 12.609952352206955),
 (u'1j9qzei', 12.609952352206955),
 (u'aresearch', 12.609952352206955),
 (u'mumfordandsons', 12.609952352206955)]

## Generate LDA Topic Models

In [15]:
from sklearn.feature_extraction.text import CountVectorizer 
import gensim

def generate_ldamodel(indArray ,min_df=1, num_topics=10):
    countvec = CountVectorizer(min_df=min_df)
    X = countvec.fit_transform(indArray)
    corpus = countvec.get_feature_names()
    id2words = dict((v, k) for k, v in countvec.vocabulary_.iteritems())
    corpus_gensim = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    ldamodel = gensim.models.ldamodel.LdaModel(corpus_gensim, id2word=id2words, num_topics=10, update_every=1, chunksize=1000, passes=1)
    return ldamodel

modelA = generate_ldamodel(lines[indA])
modelB = generate_ldamodel(lines[indB])
modelC = generate_ldamodel(lines[indC])
modelD = generate_ldamodel(lines[indD])


In [16]:
modelA.print_topics()

[u'0.027*day + 0.014*tomorrow + 0.013*free + 0.012*sucks + 0.011*email + 0.011*saturday + 0.011*class + 0.011*whats + 0.010*family + 0.010*big',
 u'0.019*guys + 0.018*shit + 0.017*call + 0.014*fees + 0.014*stop + 0.014*loans + 0.013*fucking + 0.012*building + 0.011*ass + 0.011*student',
 u'0.032*customer + 0.030*service + 0.018*dont + 0.018*open + 0.016*great + 0.015*home + 0.014*time + 0.014*mortgage + 0.013*happy + 0.013*night',
 u'0.076*account + 0.034*money + 0.020*card + 0.016*check + 0.015*phone + 0.013*time + 0.012*worst + 0.012*wow + 0.012*years + 0.012*work',
 u'0.017*team + 0.016*love + 0.016*atm + 0.013*put + 0.011*giving + 0.010*community + 0.010*paid + 0.010*coming + 0.009*day + 0.009*credit',
 u'0.074*center + 0.013*morning + 0.012*stock + 0.012*teller + 0.011*bad + 0.010*philadelphia + 0.009*world + 0.009*week + 0.009*called + 0.009*long',
 u'0.044*money + 0.022*hsbc + 0.022*goldmansachs + 0.022*economics + 0.022*federalreserve + 0.022*morganstanley + 0.022*financialterr

In [17]:
modelB.print_topics()

[u'0.042*photo + 0.041*account + 0.032*economics + 0.032*morganstanley + 0.022*money + 0.021*chicago + 0.017*marathon + 0.015*open + 0.013*team + 0.012*ass',
 u'0.036*hsbc + 0.023*home + 0.018*financial + 0.015*mortgage + 0.015*pay + 0.012*line + 0.011*times + 0.010*working + 0.010*banks + 0.010*dollars',
 u'0.013*send + 0.012*free + 0.010*find + 0.009*corporation + 0.009*union + 0.008*photos + 0.008*department + 0.007*american + 0.007*tryna + 0.007*bankbs',
 u'0.068*card + 0.028*call + 0.021*credit + 0.019*debit + 0.017*number + 0.016*phone + 0.016*called + 0.013*told + 0.010*account + 0.009*getcollegeready',
 u'0.034*service + 0.033*customer + 0.016*account + 0.014*years + 0.012*guys + 0.011*morning + 0.011*night + 0.010*fuck + 0.009*minutes + 0.008*inbox',
 u'0.040*account + 0.023*money + 0.021*work + 0.020*day + 0.019*dont + 0.018*good + 0.015*time + 0.014*rebanke + 0.011*customers + 0.011*big',
 u'0.020*buy + 0.015*rating + 0.013*stock + 0.012*building + 0.011*fucking + 0.011*shar

In [18]:
modelC.print_topics()

[u'0.064*rating + 0.036*credit + 0.035*buy + 0.035*card + 0.023*neutral + 0.017*time + 0.016*reiterated + 0.013*sell + 0.010*group + 0.010*running',
 u'0.045*goldman + 0.044*sachs + 0.015*banking + 0.011*free + 0.010*day + 0.010*gold + 0.009*banks + 0.008*cards + 0.007*card + 0.007*mortgage',
 u'0.167*shared + 0.016*target + 0.011*buy + 0.010*raised + 0.010*call + 0.008*number + 0.008*home + 0.008*list + 0.008*card + 0.007*account',
 u'0.012*market + 0.012*president + 0.012*job + 0.011*deal + 0.011*good + 0.010*fraud + 0.010*research + 0.010*loan + 0.008*analyst + 0.008*amazon',
 u'0.022*oil + 0.020*price + 0.015*report + 0.014*world + 0.014*trillion + 0.014*change + 0.013*hall + 0.011*global + 0.011*fund + 0.011*cost',
 u'0.067*finance + 0.059*wallstreet + 0.057*usbank + 0.054*financialterrorists + 0.054*classwarfare + 0.053*federalreserve + 0.053*giannis + 0.053*theodwridis + 0.051*morganstanley + 0.039*money',
 u'0.024*service + 0.022*customer + 0.014*work + 0.013*year + 0.013*great

In [19]:
modelD.print_topics()

[u'0.061*rating + 0.023*neutral + 0.021*group + 0.017*reiterated + 0.016*young + 0.013*community + 0.013*asset + 0.011*world + 0.010*yum + 0.010*years',
 u'0.019*goldman + 0.018*company + 0.018*sachs + 0.016*data + 0.012*banks + 0.012*love + 0.011*fraud + 0.011*fund + 0.010*banking + 0.009*security',
 u'0.037*card + 0.021*credit + 0.015*day + 0.013*stock + 0.011*trade + 0.011*big + 0.010*debit + 0.010*center + 0.008*services + 0.008*talk',
 u'0.067*management + 0.064*financial + 0.053*wealth + 0.049*advisers + 0.036*usbank + 0.009*market + 0.009*year + 0.007*investment + 0.007*oil + 0.007*cheshire',
 u'0.048*stockport + 0.043*managers + 0.038*finance + 0.037*shared + 0.032*overweight + 0.030*economics + 0.029*financialterrorists + 0.028*wallstreet + 0.028*classwarfare + 0.028*money',
 u'0.059*asset + 0.043*banksters + 0.015*trading + 0.009*watch + 0.008*analyst + 0.008*set + 0.008*friends + 0.007*charged + 0.007*stocks + 0.007*run',
 u'0.058*account + 0.038*money + 0.031*photo + 0.015*

## Generate Similars words from Word2Vec 

In [20]:
model2A = gensim.models.Word2Vec(text[indA], min_count=1, size=40)
model2B = gensim.models.Word2Vec(text[indB], min_count=1, size=40)
model2C = gensim.models.Word2Vec(text[indC], min_count=1, size=40)
model2D = gensim.models.Word2Vec(text[indD], min_count=1, size=40)

In [21]:
pos_words = ['getcollegeready', 'finance', 'wallstreet', 'economics', 'federalreserve']
neg_words = ['classwarfare', 'financialterrorists', 'finance', 'wallstreet', 'economics', 'federalreserve', 'banksters']

synonymA = model2A.most_similar(positive=['banka'])
synonymB = model2B.most_similar(positive=['bankb'])
synonymC = model2C.most_similar(positive=['bankc'])
synonymD = model2D.most_similar(positive=['bankd'])

In [22]:
synonymA
# 1.bank teller at BankA just flirted with me and then followed up by giving me a blue raspberry dum-dum.
# # husbandmaterial# orhethinksimalittlegirl

[('husbandmaterial', 0.5419613122940063),
 ('shrieking', 0.5392065048217773),
 ('4iugnnqps', 0.47520220279693604),
 ('ludicrous', 0.47099435329437256),
 ('3y7uax', 0.46934011578559875),
 ('yourhomeyourway', 0.4658353924751282),
 ('batess', 0.46269187331199646),
 ('julia', 0.4623016119003296),
 ('bzy0fk', 0.4602036774158478),
 ('comercal', 0.45838963985443115)]

In [23]:
synonymB

[('palmers', 0.5909887552261353),
 ('achieva', 0.5908536314964294),
 ('problam', 0.5905898809432983),
 ('kidos', 0.5829063653945923),
 ('salesmen', 0.5633691549301147),
 ('trials', 0.5237843990325928),
 ('magical', 0.5028483867645264),
 ('liberated', 0.5003407597541809),
 ('zombies', 0.4987490773200989),
 ('99problems', 0.4951059818267822)]

In [24]:
synonymC

[('tapered', 0.529935359954834),
 ('devi', 0.5161897540092468),
 ('chinastocks', 0.5062400102615356),
 ('teachmehowtoadult', 0.4992290139198303),
 ('influencing', 0.4850664734840393),
 ('dako', 0.47909513115882874),
 ('systemaicmedicalprogram', 0.454127699136734),
 ('derulo', 0.45412763953208923),
 ('establishes', 0.45327693223953247),
 ('integrati', 0.4498327672481537)]

In [25]:
synonymD

[('eatin', 0.6147010326385498),
 ('hiphopheads', 0.5570492148399353),
 ('providian', 0.5414141416549683),
 ('ahahaha', 0.5339089632034302),
 ('a1zp48', 0.5333604216575623),
 ('barlcays', 0.5299626588821411),
 ('1lbf07r', 0.5296684503555298),
 ('lmss', 0.527097225189209),
 ('edh', 0.5142173767089844),
 ('ubs32', 0.49940818548202515)]

Just for the record, the Word2Vec is not giving out really crucial information on the topic, but it could help as a good supplement.