## free-marketing-watch
Search social media for mentions of brands and collect the comments/tweets/etc.
Count mentions of each and perform sentiment analysis on the strings.

In [1]:
import requests, json, time
import pandas as pd
import matplotlib as plt
import numpy as np
from pathlib import Path
from brands import fashionlist

Now to get the comments data, put it in a dataframe, and clean the data to get what we want.

In [2]:
def get_comment_info(dataset):
    """Takes data section of pushshift json and returns list of lists
        with comment attributes."""
    comment_info = []
    for comment in dataset:
        body = comment['body']
        score = comment['score']
        id = comment['id']
        subredddit = comment['subreddit']
        comment_info.append([body,score,id,subredddit])
    return comment_info

In [3]:
def get_more_data(dataset,brand,subreddit_):
    """Loop to continue adding comments to list after hitting API limits"""
        
    try:
        comment_data + get_comment_info(dataset)
    except NameError:
        comment_data = get_comment_info(dataset)
    while len(dataset) > 0:
        after = dataset[-1]['created_utc']
        pushshift_url = f'https://api.pushshift.io/reddit/search/comment/?q={brand}&subreddit={subreddit_}&after={after}&before=1609459200&size=100&fields=body,score,id,subreddit,created_utc'
        r = requests.get(pushshift_url)
        try:
            data_json = json.loads(r.text)
        except:
            time.sleep(60)
            r = requests.get(pushshift_url)
            data_json = json.loads(r.text)
        dataset = data_json['data']
        try:
            comment_data += get_comment_info(dataset)
            
        except NameError:
            pass
    return comment_data

In [4]:
def pushshift_search(query, subreddit_, startingutc):
    """Sends request to Pushift api endpoint and retrieves comment data
    from starting utc to December 31, 2020 midnight.

    Inputs
    -------
    query: What to search for in comments.
    subreddit_: Subreddit being searched under.
    startingutc: Starting date-time to search from.
    Returns
    -------
    list of lists: Lists contain comment attributes. ex: [body, score, id, subreddit]
    """
    pushshift_url = f'https://api.pushshift.io/reddit/search/comment/?q={query}&subreddit={subreddit_}&after={startingutc}&before=1609459200&size=100&fields=body,score,id,subreddit,created_utc'
    r = requests.get(pushshift_url)
    data_json = json.loads(r.text)
    data = data_json['data']
    comment_info = get_more_data(data,query,subreddit_)
    return comment_info

In [5]:
def create_comments_df(subreddit_,brandlist):
    """Returns a pandas df with the information about comments from this year.

    Inputs
    -----
    str: subreddit to be searched, list of brands
    Return
    ------
    Pandas multiIndex dataframe.
    """
    for brand,v in brandlist.items():
        # Level 1 for the brand and level 2 for the comment info types
        columns = pd.MultiIndex.from_product([[brand],["body","score","id","subreddit"]], names = ["brand","datatype"] )
        if isinstance(v,list):
            comment_info = []
            for version in v:
                comment_info_part = pushshift_search(version, subreddit_, '1577836800') # January 1st, 2020 at 12:00 AM 
                comment_info += comment_info_part
        # Query comments mentioning the brand in a specified subreddit
        elif not isinstance(v,list):
            comment_info = pushshift_search(brand, subreddit_, '1577836800')
        try:
            comments_df = comments_df.join(pd.DataFrame(np.array(comment_info), columns = columns),how = 'outer')
        except NameError:
            comments_df = pd.DataFrame(np.array(comment_info), columns = columns)
        except ValueError:
            print(f'No mentions of {brand} found.')
    return comments_df

In [6]:
%%time
brandlist = fashionlist
subreddit = "malefashionadvice"
comments_df = create_comments_df(subreddit,brandlist)
comments_df

Wall time: 1min 13s


brand,Gap,Gap,Gap,Gap
datatype,body,score,id,subreddit
0,"So for tarters, $200 is not a high price for a...",1,fctgxyt,malefashionadvice
1,Bonobos are the nicest chinos I own and I’ve t...,1,fcv8ykj,malefashionadvice
2,"F&amp;T is ok, better than most items in the O...",1,fcvionb,malefashionadvice
3,You're describing multiple issues.\n\nBrands d...,21,fcvmtb3,malefashionadvice
4,What do you guys think about this (gap) denim ...,2,fcyy3cp,malefashionadvice
...,...,...,...,...
1456,Yeh seems like Banana Republic is the step up ...,1,ghjx5kx,malefashionadvice
1457,Gap quality is generally better than Jack &amp...,1,ghjxpws,malefashionadvice
1458,So what your looking for is really just plaid ...,1,ghm2mki,malefashionadvice
1459,I just own one pair of Spier pants so not sure...,2,ghmbr56,malefashionadvice


Run to export the df to csv. Careful about overwriting. Use the mode = 'a' line to add to an existing file.


In [6]:
p = Path.cwd() / 'data' / 'pushshiftdf.csv'
comments_df.to_csv(path_or_buf = p)
#df1.to_csv(path_or_buf = p, mode = 'a', header=False)

In [9]:
p = Path.cwd() / 'data' / 'pushshiftdf.csv'
df = pd.read_csv(p, index_col=0, header=[0,1])
df

brand,Uniqlo,Uniqlo,Uniqlo,Uniqlo,J.Crew,J.Crew,J.Crew,J.Crew,Costco,Costco,Costco,Costco
datatype,body,score,id,subreddit,body,score,id,subreddit,body,score,id,subreddit
0,Sorry for being a capitalist scum but uniqlo U...,1,fcoskn4,malefashionadvice,Happy new year!\n\nI was wondering if there ex...,1.0,fcp6bdl,malefashionadvice,(All prices CAD)\n\nJanuary - NB 574s ($80) ED...,4.0,fdkhsc7,malefashionadvice
1,Thanks. I think how it drapes is really import...,1,fcoupys,malefashionadvice,What is your budget? Wool coats generally aren...,1.0,fcrnqfn,malefashionadvice,Those Woolrich socks are borderline fraudulent...,1.0,fdrqvqr,malefashionadvice
2,I would buy my black vnecks from other places ...,1,fcouxki,malefashionadvice,I think that’s part of the reason why the cost...,1.0,fcrwp27,malefashionadvice,They are. Them and Kirkland signature from Cos...,1.0,fdyjwik,malefashionadvice
3,Not all of Uniqlo's cotton comes from Xinjiang...,1,fcoylho,malefashionadvice,Fashion Scavenger Hunt: Tracking Down Harry St...,1.0,fcsjt1t,malefashionadvice,"Honestly, Costco Kirkland is probably the best...",1.0,fecbcxx,malefashionadvice
4,I guess I've been boycotting Uniqlo because I'...,1,fcpabsq,malefashionadvice,Try the sidebar. There’s is a bunch of thread...,1.0,fcslz10,malefashionadvice,Hands Beefy and Kirkland have some great heavy...,1.0,ff6x1il,malefashionadvice
...,...,...,...,...,...,...,...,...,...,...,...,...
5948,Sweatpants from uniqlo\n\nT shirt from homage\...,1,ghk6uff,malefashionadvice,,,,,,,,
5949,"Camoshita jacket, Uniqlo merino wool longsleev...",1,ghkwr88,malefashionadvice,,,,,,,,
5950,So what your looking for is really just plaid ...,1,ghm2mki,malefashionadvice,,,,,,,,
5951,Was that you asking for advice? I just read yo...,1,ghmby1w,malefashionadvice,,,,,,,,


If you prefer pickling over csv, use these cells for IO

In [7]:
p = Path.cwd() / 'data' / 'pushshiftdf.pkl'
comments_df.to_pickle(path = p)

In [None]:
df = pd.read_pickle(filepath_or_buffer=p)