## Data set share: Apparel companies

This notebook shares information on Twitter followers for two apparel brands with vastly different business models — Patagonia and Boohoo — with the goal of comparing the makeup of followers for each organization.

Patagonia @patagonia has been a global leader in sustainable apparel for over 45 years. 

Boohoo @boohoo is a UK-based fast fashion company that has scored on the low end of the apparel industry in several reports of environmental and social impacts in recent years. 

I was interested in this data from the standpoint of analyzing any demographic and psychographic differences in the Twitter followers of each organization. My hypothesis is that there will be significant differences in the these data for people who tend to follow each organization, and that there will not be substantial overlap in their audiences.  

Followers were pulled via the Twitter API on October 28, 2021

In [1]:
import pandas as pd
import sqlite3
import random
import numpy as np
from nltk.corpus import stopwords
sw = stopwords.words('english')
from string import punctuation
from collections import Counter, defaultdict
from pprint import pprint
import datetime
from operator import itemgetter

punctuation = set(punctuation)
punctuation.add("’")
pd.options.display.float_format = '{:20,.2f}'.format

In [2]:
# Import Patagonia followers

columns = ["screen_name", "user_id", "name", "location", "followers_count", "friends_count", "description"]
pata = pd.read_csv('patagonia_followers.txt', names=columns, sep='\t', lineterminator='\n',
                   dtype={"screen_name": "str",
                          "user_id": "str",
                          "name": "str",
                          "location": "str",
                          "followers_count": "str",
                          "friends_count": "str",
                          "description": "str"})

new_header = pata.iloc[0] #grab the first row for the header
pata = pata[1:] #take the data less the header row
pata.columns = new_header

In [3]:
# Import Boohoo followers

columns = ["screen_name", "user_id", "name", "location", "followers_count", "friends_count", "description"]
boohoo = pd.read_csv('boohoo_followers.txt', names=columns, sep='\t', lineterminator='\n',
                   dtype={"screen_name": "str",
                          "user_id": "str",
                          "name": "str",
                          "location": "str",
                          "followers_count": "str",
                          "friends_count": "str",
                          "description": "str"})

new_header = boohoo.iloc[0] #grab the first row for the header
boohoo = boohoo[1:] #take the data less the header row
boohoo.columns = new_header

In [4]:
# check that data is imported correctly
print(pata.shape)
pata.iloc[:3]  

(529263, 7)


Unnamed: 0,screen_name,user_id,name,location,followers_count,friends_count,description
1,AustinH1155,1453579361404215296,Austin Hughes,,1,58,
2,jcatgn,915658561140740097,Jesus,Catalunya,354,828,
3,chelseymarie542,1453575448840056832,Chelsey Derrick,,0,20,


In [5]:
# check that data is imported correctly
print(boohoo.shape)
boohoo.iloc[:3]  

(550450, 7)


Unnamed: 0,screen_name,user_id,name,location,followers_count,friends_count,description
1,JoanneH14660255,4821469259,Joanne Hayes,,7,31,
2,TrishKe59123459,1453612890263629826,Trish Kelly,,1,46,
3,Katelyn_w123,1453605763004391424,Katelyn,,0,23,04/09/21💍👰🏻‍♀️🤵‍♂️


###  Function for EDA on organization's followers

In [6]:
def org_eda(org, num_words=10) : 
    
    # Convert to correct data types
    org = org.convert_dtypes(infer_objects=True, convert_string=True,
      convert_integer=True, convert_boolean=True, convert_floating=True)

    org = org.astype({"followers_count": "Int64", "friends_count": "Int64"},errors='ignore')
    # pata.dtypes  # check converted types, if needed
    
    # Create vars for desc stats
    Users = len(org)
    Followers = org["followers_count"].describe()
    Friends = org["friends_count"].describe()
    
    NAs = org.isna().sum()
    responses = org.notna().sum()
    
    # Nested function for creating desc stats for the two columns with string data: Location and Description
    def get_patterns(column)  :
        """
            This function takes text as an input and returns a dictionary of statistics,
            after cleaning the text. 

        """
        # Tokenization / normalization
        temprmna = column.dropna()
        clean = []

        # loop through all of the lists and append to word list
        for line in temprmna:
            for w in line.split() :
                if w.isalpha() and w.lower() not in sw :  # remember that sw list is all lowercase
                    clean.append(w.lower())

        # Calculate your statistics here
        total_tokens = len(clean)
        unique_tokens = len(set(clean))
        clean_tok_len = [len(w) for w in clean]
        avg_token_len = np.mean(clean_tok_len)
        lex_diversity = len(set(clean))/len(clean)
        top_10 = Counter(clean).most_common(10)


        # Now we'll fill out the dictionary. 
        stats = {'tokens':total_tokens,
                   'unique_tokens':unique_tokens,
                   'avg_token_length':avg_token_len,
                   'lexical_diversity':lex_diversity,
                   'top_10':top_10}

        return(stats)
    
    # Apply the nested function, create variables to place into final dictionary
    location = org["location"]
    description = org["description"]
    
    location_stats = get_patterns(location)
    description_stats = get_patterns(description)
    
    # Create final dictionary to hold results
    results = {'Desc Stats': {'Users': Users,
                              'Followers': Followers,
                             'Friends': Friends,
                              'Responses by Column': responses,
                             'NAs by Column': NAs},
               'Location Text Stats': location_stats,
               'Description Text Stats': description_stats
              }
    return(results)

In [7]:
print(f' Patagonia stats are:')
print("\n")
pprint(org_eda(pata))

# Ignore "count" stat for both Followers and Friends keys. That number is just the total number of followers
# each organization has. 

 Patagonia stats are:


{'Desc Stats': {'Followers': count             529,262.00
mean                1,102.74
std                50,563.45
min                     0.00
25%                    23.00
50%                    98.00
75%                   314.00
max            28,880,858.00
Name: followers_count, dtype: float64,
                'Friends': count             529,262.00
mean                  944.29
std                 8,639.05
min                     0.00
25%                   162.00
50%                   413.00
75%                 1,000.00
max             4,181,260.00
Name: friends_count, dtype: float64,
                'NAs by Column': screen_name             1
user_id                 0
name                   56
location           203082
followers_count         1
friends_count           1
description        178883
dtype: int64,
                'Responses by Column': screen_name        529262
user_id            529263
name               529207
location           326181
follower

In [8]:
print(f' Boohoo stats are:')
print("\n")
pprint(org_eda(boohoo))

# Ignore "count" stat for both Followers and Friends keys. That number is just the total number of followers
# each organization has. 

 Boohoo stats are:


{'Desc Stats': {'Followers': count             550,450.00
mean                  732.38
std                40,711.55
min                     0.00
25%                    25.00
50%                    95.00
75%                   261.00
max            19,365,469.00
Name: followers_count, dtype: float64,
                'Friends': count             550,450.00
mean                  759.56
std                 8,551.39
min                     0.00
25%                   156.00
50%                   367.00
75%                   798.00
max             4,181,217.00
Name: friends_count, dtype: float64,
                'NAs by Column': screen_name             0
user_id                 0
name                  105
location           239692
followers_count         0
friends_count           0
description        229015
dtype: int64,
                'Responses by Column': screen_name        550450
user_id            550450
name               550345
location           310758
followers_c

### Future work

This preliminary EDA points to a number of future directions for analysis.
1) **Plot locations** <br>
Patagonia followers are quite US-based; Boohoo's are primarily UK. It could be interesting from an audience targeting perspective to plot their actual locations.

2) **Compare followers-of-followers or friends-of-followers** <br>
You could use these lists of followers to pull information on *their* followers or friends, and compare differences in location, word usage, lexical diversity, etc. 

3) **Run lexical expansion analysis on Description text** <br>
A lexical expansion analysis on the description text could be useful for creating a model to predict, based on their Twitter description, whether a person is more likely to follow Patagonia or Boohoo.