# Web Crawling & Brand Associations

In the following, we will develop a web crawler to fetch posts from Edmunds.com's discussion forums. We will then use that data to understand associations between car brands and associations between car brands and their attributes.

Below is a web crawler that collects 5,000+ posts from the [Entry Level Luxury Performance Sedans](http://forums.edmunds.com/discussion/2864/general/x/entry-level-luxury-performance-sedans) forum on Edmunds.com.

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from pandas import DataFrame
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import re

In [7]:
# Initialize dataframe and lists to store fetched posts
edmunds = DataFrame()
username = []
date = []
message = []

In [8]:
# Crawl the forum and extract username, date, and post
page = 1
while page < 200:
    r = requests.get('http://forums.edmunds.com/discussion/2864/general/x/entry-level-luxury-performance-sedans/p' + str(page))
    soup = BeautifulSoup(r.text)
    username.extend([x.get_text() for x in soup.find_all("a", attrs={"class": "Username"})])
    date.extend([x['datetime'][:10] for x in soup.find_all("time")])
    message.extend([x.get_text().strip() for x in soup.find_all("div", attrs={"class": "Message"})])
    page += 1

In [9]:
# Store data into dataframe
edmunds['username'] = username
edmunds['date'] = date
edmunds['message'] = message

## Associations Between Car Brands

To allow for a higher-level car brand analysis, we will clean the data by searching for model names and replacing them with brands.

In [10]:
# Read in lookup table of brands and models
brands = pd.read_csv("brand_lookup.csv")

In [11]:
# Create a dictionary to use for search and replace
brands_df = brands.ix[:,'Replace']
brands_df.index = brands.ix[:,'Search']
brands_dict = brands_df.to_dict()

In [12]:
# Replace NaN posts with an empty string
edmunds['message'] = edmunds['message'].replace(np.nan,' ', regex=True)

In [13]:
# Lowercase all text
edmunds['message'] = edmunds['message'].str.lower()

In [14]:
# Find models and replace with brands
pattern = re.compile(r'\b(' + '|'.join(brands_dict.keys()) + r')\b')
for index, message in enumerate(edmunds['message']):
    edmunds['message'][index] = pattern.sub(lambda x: brands_dict[x.group()], message)

In [15]:
# Create binary document term matrix
countvec = CountVectorizer(binary=True, stop_words='english')
DTM = pd.DataFrame(countvec.fit_transform(edmunds['message']).toarray(), columns=countvec.get_feature_names())

In [16]:
# Sum the DTM to get word counts
word_counts = DTM.sum()

In [17]:
# Sort word counts descending
word_counts.sort(ascending=0)

In [18]:
# View word counts to identify top brands
word_counts.head(20)

car            3029
bmw            1754
like           1524
just           1397
acura          1350
infiniti       1219
don            1172
think          1125
drive           893
sedan           882
better          869
performance     823
new             796
know            758
people          752
good            748
really          735
best            718
driving         701
want            661
dtype: int64

In [19]:
# Initialize list of top 10 brands and create top brands DTM
top10_brands = ['bmw', 'acura', 'infiniti', 'lexus', 'audi', 'cadillac', 'honda', 'nissan', 'toyota', 'mercedes']
top10_DTM = DTM.ix[:, top10_brands]

In [20]:
# Create dictionary of brand counts
brand_count = {}
for brand in top10_brands:
    brand_count[brand] = word_counts.ix[brand]

We will use lift ratios as a measure of brand association. Lift ratios tell us whether words appear together by chance or due to association. The formula for calculating lift is P(A & B) = P(A & B) / (P(A) * P(B)), where P() indicates the probability of. A lift ratio of < 1 indicates that the words are less likely to appear together than by chance, while a lift ratio of > 1 indicates association between two words. The higher the number, the greater the association.

In [21]:
# Initialize lists to hold combination of brands and their lift scores
combo_list = []
lift_list = []

In [22]:
# Loop through top 10 car brands and calculate lift scores for each brand combination
for i in xrange(0, 10):
    for j in xrange(1, 10):
        if j > i:
            combo_list.append(top10_brands[i] + " & " + top10_brands[j])
            combo_count = sum(top10_DTM[top10_brands[i]] + top10_DTM[top10_brands[j]] == 2)
            n = len(top10_DTM)
            lift = float(combo_count * n) / (brand_count[top10_brands[i]] * brand_count[top10_brands[j]])
            lift_list.append(lift)

In [23]:
# Store into dataframe
lift_df = DataFrame()
lift_df['brands'] = combo_list
lift_df['brand1'] = [x[0:re.search('&', x).start()-1] for x in lift_df['brands']]
lift_df['brand2'] = [x[re.search('&', x).start()+2:] for x in lift_df['brands']]
lift_df['lift'] = lift_list

In [24]:
# View lift scores
lift_df

Unnamed: 0,brands,brand1,brand2,lift
0,bmw & acura,bmw,acura,1.240649
1,bmw & infiniti,bmw,infiniti,1.552704
2,bmw & lexus,bmw,lexus,2.143397
3,bmw & audi,bmw,audi,1.78582
4,bmw & cadillac,bmw,cadillac,1.451411
5,bmw & honda,bmw,honda,1.158504
6,bmw & nissan,bmw,nissan,1.15783
7,bmw & toyota,bmw,toyota,1.323863
8,bmw & mercedes,bmw,mercedes,1.805268
9,acura & infiniti,acura,infiniti,1.966568


In [25]:
# Save to csv
lift_df.to_csv(r'lift.csv', index=False)

To visualize the data, we can make a network diagram using a tool like [Gephi](https://gephi.org/) or a multi-dimensional scaling map using [XLSTAT](https://www.xlstat.com/en/download) with Excel.

* The [first network graph](https://github.com/juliaawu/mis184n-social-media-analytics/blob/master/web-crawling-and-brand-associations/network_graph.PNG) doesn't tell us much. All brands are talked about in association with each other with lift ratios > 1. However, [filtering the graph by lifts > 3](https://github.com/juliaawu/mis184n-social-media-analytics/blob/master/web-crawling-and-brand-associations/network_graph_filtered.PNG) reveals some interesting brand associations. Nissan, Honda, and Toyota are all linked, which indicates that consumers may find these brands similar. This makes sense because the brands are all economical. We also see Mercedes linked to Audi, Cadillac, and Lexus, which suggest that consumers may view Mercedes to be a point of comparison for luxury vehicles. Acura is linked to Honda, Cadillac, and Lexus in a similar way, which implies that it could be a focal point for mid-range cars. BMW and Infiniti have no connections with lift > 3. This might be because they are more unqiue in nature. Acura, Cadillac, and Audi also seem to be bridges in terms of association between a lower-end to higher-end vehicles.


* The [MDS graph](https://github.com/juliaawu/mis184n-social-media-analytics/blob/master/web-crawling-and-brand-associations/mds.PNG) shows Toyota, Honda, and Nissan clustered together in the bottom left, indicating their similarity. This is reflective of what we saw in the network graph. Mercedes, Audi, Cadillac, and Lexus are also close together with Mercedes in the center as the focal point. Acura and Infiniti are in the lower right quadrant and BMW is at the top of the graph by itself. The axis of MDS graphs are open to interpretation. Looking at the placement of these brands, I would infer the x-axis to be an indication of price and the y-axis to be an indication of performance.

This type of analysis would be useful for car companies in evaluating consumer perception and identifying direct competitors.

##Associations Between Car Brands and Attributes
Now we will conduct a similar analysis between the top 5 car brands and car attributes to identify which attributes are most commonly talked about for each brand.

In [103]:
# Read in lookup table of attributes and their synonyms
attributes = pd.read_csv("attribute_lookup.csv")

In [104]:
# Create a dictionary to use for search and replace
attributes_df = attributes.ix[:,'Replace']
attributes_df.index = attributes.ix[:,'Search']
attributes_dict = attributes_df.to_dict()

In [105]:
# Find attribute synonyms and replace with attribute
pattern = re.compile(r'\b(' + '|'.join(attributes_dict.keys()) + r')\b')
for index, message in enumerate(edmunds['message']):
    edmunds['message'][index] = pattern.sub(lambda x: attributes_dict[x.group()], message)

In [106]:
# Create binary document term matrix
DTM2 = pd.DataFrame(countvec.fit_transform(edmunds['message']).toarray(), columns=countvec.get_feature_names())

In [107]:
# Sum the DTM to get word counts
word_counts2 = DTM2.sum()

In [108]:
# Sort word counts descending
word_counts2.sort(ascending=0)

In [130]:
# View word counts to identify top attributes
word_counts2.head(25)

car             3029
performance     2434
bmw             1754
economy         1629
like            1524
just            1397
acura           1350
infiniti        1219
don             1172
engine          1140
think           1125
drive            893
sedan            882
better           869
design           800
new              796
know             758
people           752
good             748
really           735
best             718
does             633
say              620
way              616
aspirational     611
dtype: int64

The top 5 most talked about attributes are performance, economy, engine, design, and aspirational.

In [115]:
# Initialize list of top attributes and intitalize variable of top 5 brands
atts = [x for x in attributes['Replace'].unique() if x in DTM2.columns]
top5_brands = top10_brands[0:5]

In [116]:
# Create top 5 brands and attributes DTMs
top5_brands_DTM = DTM.ix[:,top10_brands[0:5]]
atts_DTM = DTM.ix[:,atts]

In [117]:
# Create dictionary of attribute counts
att_count = {}
for att in atts:
    att_count[att] = word_counts2.ix[att]

In [124]:
# Initialize lists to hold combination of brands and attributes and their lift scores
combo_list2 = []
lift_list2 = []

In [125]:
# Loop through top 5 car brands and attributes and calculate lift scores for each brand + attribute combination
for i in xrange(0, 5):
    for j in xrange(0, len(atts)):
        combo_list2.append(top5_brands[i] + " & " + atts[j])
        combo_count = sum(top5_brands_DTM[top5_brands[i]] + atts_DTM[atts[j]] == 2)
        n = len(top5_brands_DTM)
        lift = float(combo_count * n) / (brand_count[top5_brands[i]] * att_count[atts[j]])
        lift_list2.append(lift)

In [126]:
# Store into dataframe
lift_df2 = DataFrame()
lift_df2['combo'] = combo_list2
lift_df2['brand'] = [x[0:re.search('&', x).start()-1] for x in lift_df2['combo']]
lift_df2['attribute'] = [x[re.search('&', x).start()+2:] for x in lift_df2['combo']]
lift_df2['lift'] = lift_list2

In [127]:
# View lift scores
lift_df2.sort(columns=['brand','lift'], ascending=False)

Unnamed: 0,combo,brand,attribute,lift
90,lexus & hybrid,lexus,hybrid,3.008818
98,lexus & service,lexus,service,2.606650
95,lexus & reliability,lexus,reliability,1.878116
91,lexus & interior,lexus,interior,1.685696
94,lexus & price,lexus,price,1.531762
80,lexus & brand,lexus,brand,1.360237
101,lexus & warranty,lexus,warranty,1.358821
81,lexus & dealer,lexus,dealer,1.253674
100,lexus & transmission,lexus,transmission,1.141042
86,lexus & experience,lexus,experience,1.055940


In [135]:
# Filter by brand
lift_df2[lift_df2['brand'] == 'audi'].sort(columns=['lift'], ascending=False)

Unnamed: 0,combo,brand,attribute,lift
124,audi & service,audi,service,2.261375
120,audi & price,audi,price,1.957705
117,audi & interior,audi,interior,1.81921
118,audi & noise,audi,noise,1.673252
121,audi & reliability,audi,reliability,1.662594
127,audi & warranty,audi,warranty,1.637268
106,audi & brand,audi,brand,1.359517
112,audi & experience,audi,experience,1.296884
107,audi & dealer,audi,dealer,0.887848
122,audi & safety,audi,safety,0.870091


The brand + attribute lift dataframe is sorted by brand and lift so that we can see each brand with their top attributes listed first. For example, we see that Lexus is most talked about for its hybrid, service and reliability, Acura is mentioned for its price and interior, and Audi is known for its service.

In [164]:
lift_df2[lift_df2['attribute'] == "aspirational"].sort(columns="lift", ascending=False)

Unnamed: 0,combo,brand,attribute,lift
13,infiniti & aspirational,infiniti,aspirational,1.475095
8,acura & aspirational,acura,aspirational,1.418823
18,lexus & aspirational,lexus,aspirational,1.396072
23,audi & aspirational,audi,aspirational,1.388443
3,bmw & aspirational,bmw,aspirational,1.376174


The data also indicate that Infiniti is the most aspirational brand of the 5.