# Text Analysis Using Twitter

Twitter is far and away becoming the most popular source of text to analyze. Why?
- Huge user base.
- Realtime updated.
- Easy to use API.

In this example I will discuss the difference between __extracting__ and __streaming__ Twitter data.

How to use the API?
- Need a Twitter account.
- Need to associate a phone number with it.
- Then need to apply for a developer account.

They will then provide you with four keys. The package __tweepy__ is the most handy way to use the API.

In [2]:
import tweepy
import json
import pandas as pd

## Extracting Tweets
When do you want to do this? Usually when you want tweets from a specific user.

First step: make an API instance using the package __tweepy__

In [3]:
auth = tweepy.OAuthHandler(key, key_secret)
auth.set_access_token(access_key, access_secret)

api = tweepy.API(auth,)

Our API object has a lot of built-in methods. This one grabs the 20 most current tweets from my account's home timeline:

In [16]:
tweet = api.home_timeline(tweet_mode = 'extended')[5]._json

Here I will look at a specific tweet.

In [17]:
tweet

{'created_at': 'Thu Feb 25 22:54:27 +0000 2021',
 'id': 1365072697705652224,
 'id_str': '1365072697705652224',
 'full_text': 'RT @BeaconPressBks: #JamesBaldwin’s documentation of his own troubled times in NOTHING PERSONAL cuts to the core of where we find ourselves…',
 'truncated': False,
 'display_text_range': [0, 140],
 'entities': {'hashtags': [{'text': 'JamesBaldwin', 'indices': [20, 33]}],
  'symbols': [],
  'user_mentions': [{'screen_name': 'BeaconPressBks',
    'name': 'Beacon Press',
    'id': 18031870,
    'id_str': '18031870',
    'indices': [3, 18]}],
  'urls': []},
 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 1100989292543860736,
  'id_str': '1100989292543860736',
  'name': 'FutureDrChamblee✊🏾🇵🇭 LV ➡️Memphis',
  'screen_name': 'HealthCommGuy',


In [11]:
tweet = api.statuses_lookup(1364657276431179777)
tweet

TypeError: 'int' object is not iterable

To get the information in that tweet, convert it to json

In [21]:
tweet_json = tweet
tweet_json.keys()

dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'lang'])

Some info we can get:

In [22]:
tweet_json['created_at'], tweet_json['full_text']

('Thu Feb 25 22:54:27 +0000 2021',
 'RT @BeaconPressBks: #JamesBaldwin’s documentation of his own troubled times in NOTHING PERSONAL cuts to the core of where we find ourselves…')

But who is this?

In [23]:
tweet_json['user']

{'id': 1100989292543860736,
 'id_str': '1100989292543860736',
 'name': 'FutureDrChamblee✊🏾🇵🇭 LV ➡️Memphis',
 'screen_name': 'HealthCommGuy',
 'location': 'Memphis, TN  he/him/his',
 'description': '2X UNLV Grad Comm Studies, 1st year PhD Student at UofMemphis. Studying Race, media, pop culture, & health. CancerFighter, DogDad, H2O,Coffee,Beer, &Whiskey.',
 'url': 'https://t.co/gzXLXpxoDl',
 'entities': {'url': {'urls': [{'url': 'https://t.co/gzXLXpxoDl',
     'expanded_url': 'http://curtischamblee.com',
     'display_url': 'curtischamblee.com',
     'indices': [0, 23]}]},
  'description': {'urls': []}},
 'protected': False,
 'followers_count': 930,
 'friends_count': 4131,
 'listed_count': 2,
 'created_at': 'Thu Feb 28 05:21:17 +0000 2019',
 'favourites_count': 7398,
 'utc_offset': None,
 'time_zone': None,
 'geo_enabled': False,
 'verified': False,
 'statuses_count': 2254,
 'lang': None,
 'contributors_enabled': False,
 'is_translator': False,
 'is_translation_enabled': False,
 'profil

Now we have a dictionary of info about the user. Including:

In [24]:
tweet_json['user']['name'],\
tweet_json['user']['description'],\
tweet_json['user']['followers_count']

('FutureDrChamblee✊🏾🇵🇭 LV ➡️Memphis',
 '2X UNLV Grad Comm Studies, 1st year PhD Student at UofMemphis. Studying Race, media, pop culture, & health. CancerFighter, DogDad, H2O,Coffee,Beer, &Whiskey.',
 930)

But who the heck was he replying to?

In [26]:
tweet_json['retweeted_status']

{'created_at': 'Thu Feb 25 17:51:02 +0000 2021',
 'id': 1364996340976025602,
 'id_str': '1364996340976025602',
 'full_text': '#JamesBaldwin’s documentation of his own troubled times in NOTHING PERSONAL cuts to the core of where we find ourselves today. With a foreword by @imaniperry &amp; an afterword by @esglaude, keep your eye out for it this May! 👀 #BlackHistoryMonth  https://t.co/mP0aYown0o https://t.co/SHcL0577y8',
 'truncated': False,
 'display_text_range': [0, 271],
 'entities': {'hashtags': [{'text': 'JamesBaldwin', 'indices': [0, 13]},
   {'text': 'BlackHistoryMonth', 'indices': [228, 246]}],
  'symbols': [],
  'user_mentions': [{'screen_name': 'imaniperry',
    'name': 'Imani Perry',
    'id': 23410776,
    'id_str': '23410776',
    'indices': [146, 157]},
   {'screen_name': 'esglaude',
    'name': 'Eddie S. Glaude Jr.',
    'id': 180535644,
    'id_str': '180535644',
    'indices': [180, 189]}],
  'urls': [{'url': 'https://t.co/mP0aYown0o',
    'expanded_url': 'https://buff.

In [27]:
quoted_tweet = tweet_json['retweeted_status']

In [28]:
quoted_tweet['user']['name'],\
quoted_tweet['user']['screen_name'],\
quoted_tweet['full_text']

('Beacon Press',
 'BeaconPressBks',
 '#JamesBaldwin’s documentation of his own troubled times in NOTHING PERSONAL cuts to the core of where we find ourselves today. With a foreword by @imaniperry &amp; an afterword by @esglaude, keep your eye out for it this May! 👀 #BlackHistoryMonth  https://t.co/mP0aYown0o https://t.co/SHcL0577y8')

But who was SHE referring to?

In [29]:
quoted_tweet['in_reply_to_status_id']

Let's use this ID to lookup the Tweet:

In [57]:
original_tweet = api.statuses_lookup([quoted_tweet['in_reply_to_status_id']])[0]._json

In [58]:
original_tweet

{'created_at': 'Wed Feb 24 15:42:46 +0000 2021',
 'id': 1364601671146369030,
 'id_str': '1364601671146369030',
 'text': 'A critical take on my pieces on Fox News. https://t.co/tMLHm3tEnk',
 'truncated': False,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [{'url': 'https://t.co/tMLHm3tEnk',
    'expanded_url': 'https://twitter.com/brithume/status/1364589946808451072',
    'display_url': 'twitter.com/brithume/statu…',
    'indices': [42, 65]}]},
 'source': '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 17004618,
  'id_str': '17004618',
  'name': 'Nicholas Kristof',
  'screen_name': 'NickKristof',
  'location': 'Everywhere',
  'description': "Oregon farmboy turned NY Times columnist, author with my wife, @WuDunn, of Tightrope & Half the

And what Tweet was HE mentioning?

In [62]:
original_tweet['quoted_status']

{'created_at': 'Wed Feb 24 14:56:10 +0000 2021',
 'id': 1364589946808451072,
 'id_str': '1364589946808451072',
 'text': 'What’s particularly sickening is that some journalists are cheering this stuff on, while nearly all the others rema… https://t.co/lQHpqIIlnF',
 'truncated': True,
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [{'url': 'https://t.co/lQHpqIIlnF',
    'expanded_url': 'https://twitter.com/i/web/status/1364589946808451072',
    'display_url': 'twitter.com/i/web/status/1…',
    'indices': [117, 140]}]},
 'source': '<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>',
 'in_reply_to_status_id': None,
 'in_reply_to_status_id_str': None,
 'in_reply_to_user_id': None,
 'in_reply_to_user_id_str': None,
 'in_reply_to_screen_name': None,
 'user': {'id': 112047805,
  'id_str': '112047805',
  'name': 'Brit Hume',
  'screen_name': 'brithume',
  'location': 'Southwest Florida',
  'description': 'Sr. Political Analyst, Fo

And so on and so forth. Another important component of the tweet is "entities"

In [30]:
quoted_tweet['entities']

{'hashtags': [{'text': 'JamesBaldwin', 'indices': [0, 13]},
  {'text': 'BlackHistoryMonth', 'indices': [228, 246]}],
 'symbols': [],
 'user_mentions': [{'screen_name': 'imaniperry',
   'name': 'Imani Perry',
   'id': 23410776,
   'id_str': '23410776',
   'indices': [146, 157]},
  {'screen_name': 'esglaude',
   'name': 'Eddie S. Glaude Jr.',
   'id': 180535644,
   'id_str': '180535644',
   'indices': [180, 189]}],
 'urls': [{'url': 'https://t.co/mP0aYown0o',
   'expanded_url': 'https://buff.ly/3dMGLrq',
   'display_url': 'buff.ly/3dMGLrq',
   'indices': [248, 271]}],
 'media': [{'id': 1364996339516473350,
   'id_str': '1364996339516473350',
   'indices': [272, 295],
   'media_url': 'http://pbs.twimg.com/media/EvFw8FCWgAYwXNn.jpg',
   'media_url_https': 'https://pbs.twimg.com/media/EvFw8FCWgAYwXNn.jpg',
   'url': 'https://t.co/SHcL0577y8',
   'display_url': 'pic.twitter.com/SHcL0577y8',
   'expanded_url': 'https://twitter.com/BeaconPressBks/status/1364996340976025602/photo/1',
   'type':

Which contains hashtags, symbols, and any urls or users they have mentioned.

### Extracting by User

The function "user_timeline" in tweepy allows scraping one Twitter account.

In [32]:
def extract_by_user(user_id):
    tweets = api.user_timeline(screen_name=user_id, 
                           # 200 is the maximum allowed count
                           count=200,
                           include_rts = False,
                           # Necessary to keep full_text 
                           # otherwise only the first 140 words are extracted
                           tweet_mode = 'extended'
                           )
    return tweets

In [33]:
elonmusk_tweets = extract_by_user("elonmusk")
elon_tweets = [x._json for x in elonmusk_tweets]
elon_df = pd.DataFrame(elon_tweets)
elon_df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,retweet_count,favorite_count,favorited,retweeted,possibly_sensitive,lang,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
0,Thu Feb 25 22:08:03 +0000 2021,1365061022084698118,1365061022084698118,🙏 https://t.co/BBwnTndvoi,False,"[0, 1]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 1365061018829922305, 'id_str...","<a href=""http://twitter.com/download/iphone"" r...",,...,5406,66354,False,False,False,und,,,,
1,Thu Feb 25 21:50:26 +0000 2021,1365056586096459786,1365056586096459786,@chicago_glenn @RationalEtienne @skorusARK We ...,False,"[43, 126]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",1.365056e+18,...,142,2026,False,False,,en,,,,
2,Thu Feb 25 21:47:26 +0000 2021,1365055830085763081,1365055830085763081,@RationalEtienne @skorusARK Nickel is our bigg...,False,"[28, 195]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",1.365054e+18,...,167,1621,False,False,,en,,,,
3,Thu Feb 25 21:44:01 +0000 2021,1365054971146829824,1365054971146829824,@harsimranbansal @skorusARK Absolutely,False,"[28, 38]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",1.365054e+18,...,40,826,False,False,,en,,,,
4,Thu Feb 25 21:36:35 +0000 2021,1365053102794153987,1365053102794153987,@MemesOfMars @skorusARK Fremont shut down for ...,False,"[24, 98]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",1.365053e+18,...,240,1691,False,False,,en,,,,


In [45]:
elon_df['full_text'][2]

'@RationalEtienne @skorusARK Nickel is our biggest concern for scaling lithium-ion cell production. That’s why we are shifting standard range cars to an iron cathode. Plenty of iron (and lithium)!'

Twitter has a limit of __200 tweets at a time__.

What if we need more tweets than that?

One clever trick:
1. Do an initial scrape.
2. Find the tweet ID of the oldest tweet.
3. Feed that ID into the field "max_id" and subtract 1


In [46]:
oldest_id = elonmusk_tweets[-1].id

import time

all_tweets = elonmusk_tweets
while True:
    tweets = api.user_timeline(screen_name="elonmusk", 
                           # 200 is the maximum allowed count
                           count=200,
                           include_rts = False,
                           max_id = oldest_id - 1,
                           # Necessary to keep full_text 
                           # otherwise only the first 140 words are extracted
                           tweet_mode = 'extended'
                           )
    if len(tweets) == 0:
        pass
    else:
        oldest_id = tweets[-1].id
        all_tweets.extend(tweets)
    time.sleep(1)
    print('N of tweets downloaded till now {} \r'.format(len(all_tweets)), end='',flush=True)

N of tweets downloaded till now 3038 

KeyboardInterrupt: 

To write this in a file, it is best to use the json file format.

In [47]:
tweets_json = [x._json for x in all_tweets]

import json

with open("./Elon_tweets.json", 'w') as outfile:
    json.dump(tweets_json,outfile)

### When do we use this?
When you are interested in studying the Tweets of a specific user (like Elon Musk). This is quite simple to do, but it is only for times in which you need the Tweets of certain people.




### but what about the rest of Twitter?
## Streaming

This is for when you essentially want to make a realtime download from Twitter to look for certain phrases.

Useful for when you want to scrape from all of Twitter but with certain parameters. 

In [None]:
python twitter_stream.py > tweets.txt