# Analysis of Relationships and Topics among Tweets and Facebook Posts Associated with specific banks

## Questions and Deliverables

Q1. What financial topics* do consumers discuss on social media and what caused the consumers to post about this topic? 
   
>   Deliverable A - Describe your Approach and Methodology. Include a visual representation of your analytic process flow. 
   
>   Deliverable B - Discuss the data and its relationship to social conversation drivers. 

>   Deliverable C - Document your code and reference the analytic process flow-diagram from deliverable A. 


Q2. Are the topics and “substance” consistent across the industry or are they isolated to individual banks? 

>   Deliverable D - Create a list of topics and substance you found 

>   Deliverable E - Create a narrative of insights supported by the quantitative results (should include graphs or charts) 

#### Metadata:
Record Count: 220377

Media Type: Facebook & Twitter

Timeframe: Twitter data (8/2015) & Facebook data (8/2014 - 8/2015)

Scope: Social Media data with query of 4 banks

#### Scubbed Data:
4 Banks: BankA, BankB, BankC, BankD

ADDRESS: All scrubbed addresses are replaced by uppercase ADDRESS. Any occurrence of a lowercase "address" is part of the text and is not a scrubbed replacement.

Name: All names have been replaced with the lowercase word "Name"
Internet links

INTERNET: All scrubbed INTERNET references are replaced by uppercase INTERNET. Any occurrence of a lowercase "internet" is part of the text and is not a scrubbed replacement.

twit_hndl: All actual twitter handles "@" have been replaced with the lowercase abbreviation "twit_hndl". All Bank twitter handles have been replaced with the lowercase abbreviation followed by the respectively Bank "twit_hndl_BankA" , "twit_hndl_BankB"

PHONE: All scrubbed phone numbers are replaced by uppercase PHONE. Any occurrence of a lowercase "phone" is part of the text and is not a scrubbed replacement.

In [2]:
import numpy as np
import pandas as pd

ValueError: numpy.dtype has the wrong size, try recompiling

In [2]:
df = pd.read_csv('2015+Wells+Fargo+Campus+Analytic+Challenge+Dataset.txt', sep='|')
df.head(5)

Unnamed: 0,AutoID,Date,Year,Month,MediaType,FullText
0,1,8/26/2015,2015,8,twitter,3 ways the internet of things will change Bank...
1,2,8/5/2015,2015,8,twitter,BankB BankB Name downgrades apple stock to neu...
2,3,8/12/2015,2015,8,twitter,BankB returns to profit on INTERNET/! board2? ...
3,4,8/5/2015,2015,8,twitter,BankB tells advisers to exit paulson hedge fun...
4,5,8/12/2015,2015,8,twitter,BankC may plead guilty over foreign exchange p...


In [None]:
import re

# preprocess the scrubbed strings
p0 = re.compile('ADDRESS|Name|INTERNET|twit_hndl_?|PHONE')
pretext = df['FullText'].map(lambda x: p0.sub(' ', x))

## Find the most popular tags

In [3]:
text0 = pretext.map(lambda t: t.lower().split())
rawtags1 = text0.map(lambda t: [re.search('\w+#\B', w) for w in t])
rawtags1 = rawtags1.map(lambda t: [w.group(0) for w in t if w is not None])
rawtags2 = []
for t in rawtags1:
    if len(t) > 0:
        for _ in t:
            rawtags2.append(_)
rawtags3 = list(set(rawtags2))
rawtags4 = dict((a, rawtags2.count(a)) for a in rawtags3)
hashtags = sorted(rawtags4.iteritems(), key=lambda (k, v): -v)
hashtags[:20]

[('name#', 8010),
 ('getcollegeready#', 4365),
 ('bankb#', 2769),
 ('bankd#', 2526),
 ('bankc#', 2449),
 ('finance#', 2290),
 ('internet#', 2084),
 ('money#', 2010),
 ('goldmansachs#', 1967),
 ('wallstreet#', 1946),
 ('banksters#', 1924),
 ('usbank#', 1915),
 ('hsbc#', 1914),
 ('economics#', 1914),
 ('federalreserve#', 1910),
 ('financialterrorists#', 1910),
 ('classwarfare#', 1906),
 ('morganstanley#', 1905),
 ('twit_hndl#', 1442),
 ('twit_hndl_banka#', 1298)]

## Prepare the Data Grouped By Banks

In [4]:
text = pretext.map(lambda x: x.lower())
text = text.map(lambda x: re.split('\W+|_+', x, flags=re.IGNORECASE))
text.head(5)

0    [3, ways, the, internet, of, things, will, cha...
1    [bankb, bankb, name, downgrades, apple, stock,...
2    [bankb, returns, to, profit, on, internet, boa...
3    [bankb, tells, advisers, to, exit, paulson, he...
4    [bankc, may, plead, guilty, over, foreign, exc...
Name: FullText, dtype: object

In [5]:
indA = ['banka' in t for t in text]
indB = ['bankb' in t for t in text]
indC = ['bankc' in t for t in text]
indD = ['bankd' in t for t in text]

In [6]:
# pre-complie patterns to speed up
p1 = re.compile("(RT|via)((?:\b\W*@\w+)+)")
p2 = re.compile("@\w+")
p3 = re.compile("[0-9]+")
p4 = re.compile("http\w+")
p5 = re.compile("^\s+|\s+$")
p6 = re.compile("^\w*(\w)(\1){2,}\w*&")
text = text.map(lambda x: [p1.sub("", t) for t in x])
text = text.map(lambda x: [p2.sub("", t) for t in x])
text = text.map(lambda x: [p3.sub("", t) for t in x])
text = text.map(lambda x: [p4.sub("", t) for t in x])
text = text.map(lambda x: [p5.sub("", t) for t in x])
text = text.map(lambda x: [p6.sub("", t) for t in x])
text.head(2)

0    [, ways, the, internet, of, things, will, chan...
1    [bankb, bankb, name, downgrades, apple, stock,...
Name: FullText, dtype: object

In [7]:
stopwords1 = pd.read_csv('stopwords.txt')
stopwords2 = ['banka','bankb','bankc','bankd','bank','hndl','twit','lol','hey','make','name','don','bit','uhijre','ret','bankac','resp','ers','er','today','ift','dlvr','plc','goo','man','banke','bankds']

def rmStopWord(wlist):
    return [w for w in wlist if not (w == '' or w in stopwords1.values or w in stopwords2)]     

text = text.map(lambda x: rmStopWord(x))
text.head(2)

0    [ways, internet, things, change, tt, uad, inte...
1    [downgrades, apple, stock, neutral, anticipate...
Name: FullText, dtype: object

## Learn From TF-IDF Matrix

In [8]:
lines = text.map(lambda x: ' '.join(x))

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvec = TfidfVectorizer(min_df=1)
Xtfidf = tfidfvec.fit_transform(lines)

## Generate LDA Topic Models

In [13]:
from sklearn.feature_extraction.text import CountVectorizer 
import gensim

def generate_ldamodel(indArray ,min_df=1, num_topics=10):
    countvec = CountVectorizer(min_df=min_df)
    X = countvec.fit_transform(indArray)
    corpus = countvec.get_feature_names()
    id2words = dict((v, k) for k, v in countvec.vocabulary_.iteritems())
    corpus_gensim = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
    ldamodel = gensim.models.ldamodel.LdaModel(corpus_gensim, id2word=id2words, num_topics=10, update_every=1, chunksize=1000, passes=1)
    return ldamodel

modelA = generate_ldamodel(lines[indA])
modelB = generate_ldamodel(lines[indB])
modelC = generate_ldamodel(lines[indC])
modelD = generate_ldamodel(lines[indD])


In [14]:
modelA.print_topics()

[u'0.045*card + 0.036*photo + 0.030*shared + 0.025*credit + 0.019*business + 0.018*economics + 0.015*debit + 0.013*night + 0.011*working + 0.011*money',
 u'0.038*money + 0.023*finance + 0.017*hsbc + 0.016*cash + 0.012*made + 0.012*days + 0.011*fb + 0.011*buy + 0.011*information + 0.010*card',
 u'0.045*contest + 0.044*getcollegeready + 0.018*love + 0.015*school + 0.014*awesome + 0.011*long + 0.011*set + 0.011*win + 0.011*story + 0.010*ready',
 u'0.024*goldmansachs + 0.024*banksters + 0.024*morganstanley + 0.023*wallstreet + 0.023*financialterrorists + 0.023*classwarfare + 0.021*wow + 0.020*mortgage + 0.017*home + 0.016*fund',
 u'0.068*center + 0.056*internet + 0.014*tickets + 0.014*financial + 0.013*pm + 0.012*tomorrow + 0.011*email + 0.011*college + 0.009*coming + 0.009*philadelphia',
 u'0.055*internet + 0.050*ly + 0.012*home + 0.012*loan + 0.011*market + 0.011*loans + 0.010*fuck + 0.010*community + 0.009*program + 0.009*world',
 u'0.077*internet + 0.018*guys + 0.016*worst + 0.013*supp

In [15]:
modelB.print_topics()

[u'0.050*photo + 0.033*check + 0.015*people + 0.014*line + 0.013*business + 0.012*branch + 0.012*teller + 0.011*hate + 0.011*hold + 0.011*pm',
 u'0.043*internet + 0.041*finance + 0.041*stadium + 0.036*hsbc + 0.035*financialterrorists + 0.016*charlotte + 0.014*team + 0.013*panthers + 0.012*game + 0.011*watch',
 u'0.045*internet + 0.037*morganstanley + 0.025*home + 0.018*mortgage + 0.018*buy + 0.014*pay + 0.014*loan + 0.012*free + 0.011*st + 0.011*times',
 u'0.026*chicago + 0.022*marathon + 0.016*goldman + 0.013*sachs + 0.012*yall + 0.011*day + 0.011*internet + 0.011*video + 0.010*tickets + 0.009*shit',
 u'0.020*yo + 0.018*fuck + 0.016*rating + 0.014*internet + 0.010*work + 0.009*post + 0.009*rebanke + 0.009*america + 0.009*shit + 0.009*overdraft',
 u'0.041*card + 0.026*service + 0.025*customer + 0.018*credit + 0.018*time + 0.016*years + 0.016*debit + 0.015*account + 0.014*banking + 0.013*worst',
 u'0.083*money + 0.044*shared + 0.034*goldmansachs + 0.034*wallstreet + 0.034*usbank + 0.034

In [16]:
modelC.print_topics()

[u'0.036*buy + 0.029*internet + 0.016*stock + 0.013*work + 0.011*people + 0.010*big + 0.008*money + 0.008*real + 0.008*good + 0.008*interest',
 u'0.100*internet + 0.015*ly + 0.015*banking + 0.012*wow + 0.010*city + 0.008*global + 0.007*report + 0.007*ceo + 0.007*tower + 0.007*plaza',
 u'0.054*card + 0.051*credit + 0.027*account + 0.013*cards + 0.011*banks + 0.010*rebanke + 0.008*cash + 0.007*pay + 0.007*time + 0.007*put',
 u'0.078*goldmansachs + 0.022*service + 0.020*customer + 0.013*manager + 0.012*free + 0.011*account + 0.010*customers + 0.007*call + 0.007*mortgage + 0.007*days',
 u'0.162*shared + 0.082*wallstreet + 0.081*economics + 0.013*open + 0.009*ny + 0.007*international + 0.007*house + 0.006*data + 0.006*night + 0.005*start',
 u'0.045*sachs + 0.030*goldman + 0.030*internet + 0.018*ly + 0.018*group + 0.017*trader + 0.012*tom + 0.012*gl + 0.012*case + 0.010*buy',
 u'0.076*internet + 0.056*hsbc + 0.053*federalreserve + 0.053*rating + 0.035*tt + 0.021*neutral + 0.016*price + 0.016

In [17]:
modelD.print_topics()

[u'0.064*internet + 0.028*ly + 0.015*transfer + 0.015*young + 0.015*news + 0.014*data + 0.013*target + 0.012*million + 0.012*investment + 0.011*price',
 u'0.061*account + 0.033*card + 0.020*neutral + 0.020*credit + 0.012*deposit + 0.011*fb + 0.010*open + 0.010*manager + 0.009*debit + 0.008*life',
 u'0.019*dont + 0.015*people + 0.014*pay + 0.013*great + 0.012*check + 0.009*years + 0.008*call + 0.008*give + 0.008*whats + 0.007*banks',
 u'0.074*internet + 0.049*financial + 0.049*asset + 0.043*stockport + 0.038*managers + 0.037*advisers + 0.035*photo + 0.028*overweight + 0.009*trading + 0.008*report',
 u'0.069*money + 0.032*shared + 0.032*finance + 0.027*hsbc + 0.027*banksters + 0.026*wallstreet + 0.026*economics + 0.026*morganstanley + 0.026*classwarfare + 0.026*financialterrorists',
 u'0.064*rating + 0.058*internet + 0.040*tt + 0.018*reiterated + 0.009*yum + 0.007*house + 0.007*trade + 0.007*robbery + 0.007*billion + 0.006*downgraded',
 u'0.043*internet + 0.015*stock + 0.013*community + 

## Generate Similars words from Word2Vec 