# Pharma NLP tweets analysis


Having found the Harvard Business Review's ranking of the most skilled companies on twitter ["50 Companies That Get Twitter – and 50 That Don’t"](https://hbr.org/2015/04/the-best-and-worst-corporate-tweeters), I decided to carry out a small analysis myself. 

The HBR's analysis was conducted on 350,00 tweets of 300 companies listed on NASDAQ, NYSE or FTSE and presents an "empathy" scoring, where the most empathetic companies are on top. Even though the author explains that the methodology assumes empathy consists of: "reassurance, authenticity, and emotional connection" it's a difficult task to actually measure it in real life. On the other hand, it is possible to measure engagement.


In the ranking AstraZeneca took the last place - why is that? Do we differ that much from other pharmaceutical companies? Let's check!

Code can be found on my GitHub account: [mbalcerzak](https://github.com/mbalcerzak/twitter_pharma)

## Do pharmaceutical companies need Twitter?
##### It's important to engage with patients on twitter 

The new generation get their news from social media. Twitter is a way to communicate and educate patients, solve problems and inform. That's why it's so important to engage with the followers in an authentic and empathetic way. 

According to the article ["It’s Time to Tweet—How Pharma Should Be Using Twitter"](https://www.pm360online.com/its-time-to-tweet-how-pharma-should-be-using-twitter/) there are numerous benefits to engaging with patients and investors on social media:
 - top 10 biggest pharmaceutical companies already use Twitter
 - accessible and timely information for patients and regulators
 - interacting with opinion leaders and "pharmaceutical influencers"
 - increased comapnu reputation
 - better customer service
 - advertising opportunity (also for future hiring)


##### Good text source for NLP analysis

Personally, I wanted to learn web scraping and Natural Language Processing and Twitter provides excellent starting tools and is a great data mining playground. Twitter encourages people to have open discussions and is frequently used by both companies and consumers. Most businesses use twitter, pharmaceutical companies are no exception. 

### Scraping data from Twitter

Tweets of the chosen companies were collected using [**tweepy**](https://www.tweepy.org/) package for Python 3.x. The accounts are officially verified. 
- [AstraZeneca](https://twitter.com/AstraZeneca) 
- [Johnson & Johnson](https://twitter.com/JNJCares)
- [Roche](https://twitter.com/Roche)
- [Pfizer](https://twitter.com/Pfizer) 
- [Novartis](https://twitter.com/Novartis)
- [BayerPharma](https://twitter.com/BayerPharma) 
- [Merck](https://twitter.com/Merck) 
- [GSK](https://twitter.com/GSK) 
- [Sanofi](https://twitter.com/Sanofi)
- [Abbvie](https://twitter.com/abbvie)
- [Abbott](https://twitter.com/AbbottGlobal) 
- [Eli Lilly and Company](https://twitter.com/LillyPad) 
- [Amgen](https://twitter.com/Amgen) 
- [Bristol-Myers Squibb](https://twitter.com/bmsnews) 
- [GileadSciences](https://twitter.com/GileadSciences) 

My code is fully available on GitHub: [mbalcerzak/../webscraping/](https://github.com/mbalcerzak/twitter_pharma/tree/master/web_scraping)

In [2]:
# packages used

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import collections
import re
import os

from nltk.stem import WordNetLemmatizer
from scipy import stats
from nltk.corpus import stopwords

In [9]:
path = 'C:/Users/malgo_000/Desktop/Web_scraping/twitter_scraping/tweet_texts_pharma/'
#path = os.path.join(os.getcwd(), 'tweet_texts_pharma/')

def prepare_dataset(company):
    df = pd.read_csv(path + '%s_tweets.txt' % company, sep='|')
    
    df['company'] = company
    df['id'] = df['id'].apply(str)
    
    df['hashtags'] = df['text'].apply(lambda s: re.findall(r'#(\w+)', s))
    df['num_hash'] = df['hashtags'].apply(len)
    
    df['tagged'] = df['text'].apply(lambda s: re.findall(r'@(\w+)', s))
    
    def clean_tweet(tweet):
        check = '(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)'
        return ' '.join(re.sub(check, ' ', tweet).split()).replace('RT ','')
        
    df['clean_tweet'] = [clean_tweet(tweet) for tweet in df['text']]    
    df['len'] = df['clean_tweet'].apply(len)
    
    df['datetime'] = pd.to_datetime(df['created_at'])
    df['hour'] = df['datetime'].apply(lambda x: x.hour)
    df['month'] = df['datetime'].apply(lambda x: x.month)
    df['day'] = df['datetime'].apply(lambda x: x.day)
    df['year'] = df['datetime'].apply(lambda x: x.year)
    df = df.drop(columns=['created_at'])
    
    return df

df = prepare_dataset('AstraZeneca')

### Exploratory Data Analysis

**id** -  
**text** -  
**retweet** -  
**source** -  
**fav** -  
**RT** -  
**hashtags** -  
**company** -  
**num_hash** -  
**tagged** -  
**clean_tweet** -  
**len** -  
**datetime** -  
**hour** -  
**month** -  
**day** -  
**year** -  



```python
    def clean_tweet(tweet):
        check = '(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)'
        return ' '.join(re.sub(check, ' ', tweet).split()).replace('RT ','')
```        

In [16]:
print('\n'.join(['**'+x+'** -  ' for x in list(df)]))

**id** -  
**text** -  
**retweet** -  
**source** -  
**fav** -  
**RT** -  
**hashtags** -  
**company** -  
**num_hash** -  
**tagged** -  
**clean_tweet** -  
**len** -  
**datetime** -  
**hour** -  
**month** -  
**day** -  
**year** -  


In [None]:
    
# getting rid of outliers
df['z'] = np.abs(stats.zscore(df['fav']))
print('Tweets that can be classified as outliers when it comes to the number of likes: ')
display(df[df['z'] >= 3])
df = df[df['z'] < 3]

In [51]:
print('Average length of AstraZeneca tweets: {}'.format(round(np.mean(df['len']))))
print('Number of likes for the most liked tweet: {}'.format(np.max(df['fav'])))
print('An average tweet received {} likes'.format(round(np.mean(df['fav']))))
print('AZ tweets got on average {} likes and retweets {}'.format(round(np.mean(df['fav'][df['retweet']==False])),
                                                                 round(np.mean(df['fav'][df['retweet']==True]))))

Average length of AstraZeneca tweets: 142
Number of likes for the most liked tweet: 1213
An average tweet received 16 likes
AZ tweets got on average 12 likes and retweets 31


In [57]:
print('A tweet that received the most likes:')
display(df[['clean_tweet','hashtags','fav','RT']][df['fav']==max(df['fav'])])

print('Most liked non-retweeted tweet:')
df_org = df[df['retweet']==False]
display(df_org[['clean_tweet','hashtags','fav','RT']][df_org['fav']==max(df_org['fav'])])

A tweet that received the most likes:


Unnamed: 0,clean_tweet,hashtags,fav,RT
956,Happy LGBTSTEMDay one and all Let s work together for more inclusivity and more support for LGBTQ people in STEM,[LGBTSTEMDay],1213,790


Most liked non-retweeted tweet:


Unnamed: 0,clean_tweet,hashtags,fav,RT
845,We are named a global sustainability leader by the 2018 Dow Jones Sustainability Index We received industry top scores in the areas Environmental Reporting Labour Practice Indicators and Health Outcome Contribution sustainability DJSI ESG,"[sustainability, DJSI, ESG]",325,16


### Text normalisation

- converting text into lower-case words
- removing hashtags, links and stopwords
- lemmatization [Wikipedia explanation](https://en.wikipedia.org/wiki/Lemmatisation)

In [58]:
# Merging all datasets with company tweets into one dataset

def combine_tweets(company_names):
    df_all = df
    for company in company_names:
        df_all = df_all.append(prepare_dataset(company), ignore_index = True)        
    return df_all

df_all = combine_tweets(['JNJCares', 'Roche', 'Pfizer','Novartis', 
                         'BayerPharma', 'Merck','GSK','Sanofi', 'abbvie', 
                         'AbbottGlobal','LillyPad', 'Amgen', 'bmsnews',
                         'GileadSciences'])

In [None]:


# I'm setting up the list of words used frequently in order to remove them later from tweets
stop_words = set(stopwords.words('english'))

#print(sorted(stop_words))



## Time series
#time_fav = pd.Series(data=df['fav'].values, index=df['created_at'])
#time_fav.plot(figsize=(16, 4), color = 'r', label = 'favourites', legend = True)                
#
## for retweets
#time_rt = pd.Series(data=df['RT'].values, index=df['created_at'])
#time_rt.plot(figsize=(16, 4), color = 'b', label = 'retweets', legend = True)
#
## number of hashtags
#time_rt = pd.Series(data=df['num_hash'].values, index=df['created_at'])
#time_rt.plot(figsize=(16, 4), color = 'g', label = 'hashtags', legend = True)
#
#plt.show()   

# barplot of number of hashtags per tweet
 
df['num_hash'].hist(color = 'b', label = 'numer of hashtags')
plt.show()  

counter_hsh = collections.Counter(df['num_hash'])
print(counter_hsh.most_common()) 

#lemmatize + lower()

lemmatizer = WordNetLemmatizer() 

# join all words:
all_tweets = []
all_tweets_lem = []

for tweet in df['clean_tweet']:
    for word in tweet.split(' '):
        if word.lower() not in stop_words:
            all_tweets.append(word.lower())
            all_tweets_lem.append(lemmatizer.lemmatize(word.lower()))
            
# most common words
counter = collections.Counter(all_tweets)
print(counter.most_common(15))         

# most common lemmatized words
counter_l = collections.Counter(all_tweets_lem)
print(counter_l.most_common(15))

# most liked tweets od AZ
df.nlargest(3, 'fav')
df_all.nlargest(15, 'fav')['clean_tweet']# overall

d = pd.DataFrame(counter.most_common(15), columns = ['Word', 'Count'])
d.plot.bar(x='Word',y='Count')

# worcloud
#plt.figure(figsize = (30,30))
#wordcloud_ = WordCloud(
#                      background_color = 'white',
#                      max_words = 1000,
#                      max_font_size = 120,
#                      width=600, height=400,
#                      random_state = 42
#                    ).generate(' '.join([a for a in all_tweets]))
#
##Plotting the word cloud
#plt.imshow(wordcloud_)
#plt.axis('off')
#plt.show()


# most common hashtags
hsh_list= []
for h in list(df['hashtags']):
    hsh_list += h 
      
counter_h = collections.Counter(hsh_list)
print(counter_h.most_common(15)) 

# wordcloud of hashtags
#plt.figure(figsize = (30,30))
#wordcloud_ = WordCloud(
#                      background_color = 'white',
#                      max_words = 1000,
#                      max_font_size = 120,
#                      width=800 ,height=400,
#                      random_state = 42,
#                      collocations=False,
#                    ).generate(' '.join([a for a in hsh_list]))
#
#plt.imshow(wordcloud_)
#plt.axis('off')
#plt.show()

# most popular tagged accounts
tag_list= []
for t in list(df['tagged']):
    tag_list += t 
      
counter_t = collections.Counter(tag_list)
print(counter_t.most_common(15))   


# number of retweets

#ax1 = sns.countplot(df['retweet'], palette='rainbow')
##ax1.set_title('%s's tweets' % company)
#ax1.set(xticklabels=['Tweets','Retweets'])

#Number of tweets hourly
#hourly_tweets = df['hour'].size().unstack()
#hourly_tweets.plot(title='Hourly Tweet Counts', colormap='coolwarm')

hourly_tweets = df_all.groupby(['hour', 'company']).size().unstack()
hourly_tweets.plot(title='Hourly Tweet Counts', stacked = True, colormap='coolwarm')

#Number of tweets by the months
monthly_tweets = df_all.groupby(['month', 'company']).size().unstack()
monthly_tweets.plot(title='Monthly Tweet Counts', colormap='winter')

# scatterplot of likes vs hour of posting

ax = sns.scatterplot(x="hour", y="fav", hue="company", data=df_all, palette="Purples")


### Clustering (themes)

### Sector averages

- how many times they tweeter per period
- 3 most popular posts of this year, which company
- tweets with / without images

- best time to tweet
"Human working memory exhibits inherent variation across time of day and is highest when we wake up in the morning, lowest in mid-afternoon, and moderate in the evening. Higher availability of working memory makes individuals alert and feel the need to seek information. This means that consumers’ desire to engage with content will likely be highest in the morning, lowest in the afternoon, and moderate in the evening."
"Assuming the majority of the audience start their day in the morning, it is ideal to post content conveying high-arousal emotion (i.e., angry or worried) in the morning and “deep think” content in the afternoon"
https://hbr.org/2018/09/a-study-shows-the-best-times-of-day-to-post-to-social-media

- długość tweeta





### References:
1. ["50 Companies That Get Twitter – and 50 That Don’t"](https://hbr.org/2015/04/the-best-and-worst-corporate-tweeters)
2. [List of largest pharmaceutical companies by revenue](https://en.wikipedia.org/wiki/List_of_largest_pharmaceutical_companies_by_revenue)
3. [Twitter Dev Documentation](https://developer.twitter.com/en/docs)
4. ["It’s Time to Tweet—How Pharma Should Be Using Twitter"](https://www.pm360online.com/its-time-to-tweet-how-pharma-should-be-using-twitter/)